-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997
Comments
Per our conversation with our Dyntrace support rep: Our dev's have idenitified some notable behavior. As mentioned before, on the cluster/active gate side we were not seeing 503's as shown by the OTel exporter. This was the main source of confusion, but while 503's were not present, we were reporting large spikes in 429 errors. This throttling error lines up far more with the envoy side spam. The reasoning for why the 429's appear as 503s seem to be a envoy/otel quirk. Searching online for similar issues has shown some similar cases where when the export endpoint stops responding, the 503 appears rather than the expected result. Unfortunately there is not much information on why this is, but it is envoy side and not a large focus. Would you be able to setup the exporter based on the previously shared documentation? This should help provide more robust artifacts showing where the traffic is coming from. In addition, I would work further on the Envoy/Otel side via a github issue or so as there seems to be a problematic behavior with the tracing. Even with killing our ingress object, pod monitors, and readinessprobe, thus killing any sources of traffic that would be originating to this test app, we STILL get constant 503's. So what traces is Envoy generating for an app with no ingress/egress traffic? Istio proxy logs are completely blank except for these nonstop 503's. We sure there isn't a problem here? |
To add, what I observed and was able to reproduce was: I configured envoy to export the traces to a "dummy" OTLP back-end that returned always HTTP 429s and Envoy somehow interprets that as 503's and seems to constantly retry to send it. I sent a single request to Envoy but the Envoy logs are filled with these (non-stop) |
Actually I think I found what it is :D I ran a bare bones Envoy and while debugging, I noticed it was always calling the But the OTLP request is empty. There's just a resource there and it sends those non-stop. I pointed Envoy to webhook.site and immediately got 429s. The problem is that we seem to still export when there's no spans. It is simply exporting if there's an exporter configured So the fix here should be to only export when there's an exporter AND there are spans to export. @jbilliau-rcd The reason I think you run into the Dynatrace 429s is this. And I also saw that by default the "timer" to export is called every 5ms? Not sure what to make of this line https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L161. I will send a PR to address this. |
@joaopgrassi one last questions of you; if we were to proceed with Istio 1.22 rollout in spite of these errors, would we have issues with reporting of spans/traces to Dynatrace? Meaning, if Envoy is constantly trying to export spans, even when empty, it results in Dynatrace 429'ing, right? Would that then cause ACTUAL spans/traces we want from calls to also be 429's, since DT is already rate limiting us from the garbage empty ones? |
Yes, exactly. The "empty" OTLP messages are still valid so they still are accepted by Dynatrace, thus consume your tenant's quota. I don't know from the top of my head, but it may be possible to filter those out if you export them to a OTel collector, and then from the collector to Dynatrace. If that's not an option, then I'm afraid you would need to wait for a release of the updated Istio. What we can do is see if we can backport the fix to Envoy 1.30/1.31 releases and then let Istio just update that. Then you use the new Istio patch release. |
…oxy#36313) Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes envoyproxy#35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
…oxy#36313) Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes envoyproxy#35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
I submitted backports for this issue on versions 1.30 and 1.31 #36405, #36404. One thing I saw is that Envoy seems to have a way to configure the export timeout ( So, in theory you could increase that to, say 1 minute or more. That then can help because it won't export that often, and will increase the chance of actually having real spans in there. I think this might help then with reducing the number of requests and Dynatrace throttling them. The problem is that I don't think this can be enabled via Istio. Maybe via an Envoy filter? But can be dangerous 🤔 |
…oxy#36313) Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes envoyproxy#35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
@jbilliau-rcd the backports I opened yesterday are already merged and probably soon a new Istio version is released to address this. I understand the frustration of course. Lots of folks run the collector in between and then batching that works better and you would be surprised on how long organizations take to upgrade software, specially ones that are crucial like Istio/Envoy. This bug has been here since the OTel tracer was introduced which is years by now. Trying to get things better here, I investigated more and I found a temporary solution that you can use to reduce the number of requests going out from Envoy and hopefully avoid going through the throttling in Dynatrace. Envoy has the concept of Runtime configuration and that can be changed via Istio by deploying a
Then you add annotations to your pods definitions, which tells Istio that it should use this bootstrap config when injecting the Envoy proxies: template:
metadata:
labels:
app: liveness-http
version: v1
annotations:
sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config" # ADD THIS ANNOTATIOn The important thing is that the value of This changes the OTel exporter to export every minute OR when Envoy has 512 spans waiting to be exported. This should reduce the number of requests that envoy makes and then will contain real spans, as 1 minute should be enough time for a real load to generate spans. You can tweak this to your needs as well, like increase the export timeout even more, to like, 5 min. It highly depends on the load you expect and the number of spans that will be generated. Once the backports are published, you can stop using this. Let me know if it works for you. If you still run into request throttling with this, please reach out to support and we can try increasing your limits as well. |
I will open separated issues to allow configuring both |
So if this configures the OTEL exported to export every minute OR when it has 512 spans....that would completely eliminate empty spans, correct? Heck, even if I tell is to wait til it has 2 spans...that's still better than then empty ones. This is good stuff, thanks for all the deep diving into this, it's been a huge help and the only thing lifting the block that we've had on this migration for 2 months now. |
Not quite.
The default for |
Both those settings sounds really low....defaults should be much higher, or better yet, configurable (ideally through Istio, so we aren't writing custom configmaps and annotations all over the place). Not trying to give you another PR to do, just kinda speaking out loud. As per my response to you on another issue, we are going to go down the route of this OTEL collector first, since the scope of deploying annotations and configmaps to thousands of deployments across hundreds of clusters isn't the most feasible option (hence my first sentence). |
Current, envoy hardcodes the fact that a failing OLTP export must be written to ENVOY_LOG, and thus, produced in the log. We are currently piloting istio 1.22 with Dynatrace and are constantly seeing this in the logs:
2024-09-05T18:28:10.787073363Z 2024-09-05T18:28:10.786914Z error envoy tracing external/envoy/source/extensions/tracers/opentelemetry/http_trace_exporter.cc:86 OTLP HTTP exporter received a non-success status code: 503 while exporting the OTLP message thread=23
This is being produced by this block of code -
envoy/source/extensions/tracers/opentelemetry/http_trace_exporter.cc
Lines 81 to 92 in a65765e
PROPOSED FIX
Since Istio currently hardcodes a sample rate of 100 when using a custom OTEL provider (another issue I need to raise on their GitHub), and Dynatrace rate-limits OLTP exports, this makes parsing envoy logs very difficult since 70% of the logs are this 503 line. It would be nice if this could have a variable or argument to optionally "fail silent" since I don't really care if OTEL traces fail when sending to Dynatrace.
The text was updated successfully, but these errors were encountered: