Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997

jbilliau-rcd · 2024-09-05T18:44:37Z

Current, envoy hardcodes the fact that a failing OLTP export must be written to ENVOY_LOG, and thus, produced in the log. We are currently piloting istio 1.22 with Dynatrace and are constantly seeing this in the logs:

2024-09-05T18:28:10.787073363Z 2024-09-05T18:28:10.786914Z error envoy tracing external/envoy/source/extensions/tracers/opentelemetry/http_trace_exporter.cc:86 OTLP HTTP exporter received a non-success status code: 503 while exporting the OTLP message thread=23

This is being produced by this block of code -

envoy/source/extensions/tracers/opentelemetry/http_trace_exporter.cc

Lines 81 to 92 in a65765e

    
           void OpenTelemetryHttpTraceExporter::onSuccess(const Http::AsyncClient::Request& request, 
        
                                                          Http::ResponseMessagePtr&& http_response) { 
        
             active_requests_.remove(request); 
        
             const auto response_code = Http::Utility::getResponseStatus(http_response->headers()); 
        
             if (response_code != enumToInt(Http::Code::OK)) { 
        
               ENVOY_LOG(error, 
        
                         "OTLP HTTP exporter received a non-success status code: {} while exporting the OTLP " 
        
                         "message", 
        
                         response_code); 
        
             } 
        
           }

PROPOSED FIX

Since Istio currently hardcodes a sample rate of 100 when using a custom OTEL provider (another issue I need to raise on their GitHub), and Dynatrace rate-limits OLTP exports, this makes parsing envoy logs very difficult since 70% of the logs are this 503 line. It would be nice if this could have a variable or argument to optionally "fail silent" since I don't really care if OTEL traces fail when sending to Dynatrace.

The text was updated successfully, but these errors were encountered:

phlax · 2024-09-06T20:09:09Z

cc @AlexanderEllis @yanavlasov

jbilliau-rcd · 2024-09-23T19:47:02Z

Per our conversation with our Dyntrace support rep:

Our dev's have idenitified some notable behavior. As mentioned before, on the cluster/active gate side we were not seeing 503's as shown by the OTel exporter. This was the main source of confusion, but while 503's were not present, we were reporting large spikes in 429 errors. This throttling error lines up far more with the envoy side spam. The reasoning for why the 429's appear as 503s seem to be a envoy/otel quirk. Searching online for similar issues has shown some similar cases where when the export endpoint stops responding, the 503 appears rather than the expected result. Unfortunately there is not much information on why this is, but it is envoy side and not a large focus.

Would you be able to setup the exporter based on the previously shared documentation? This should help provide more robust artifacts showing where the traffic is coming from. In addition, I would work further on the Envoy/Otel side via a github issue or so as there seems to be a problematic behavior with the tracing.
https://istio.io/latest/docs/tasks/observability/logs/otel-provider/

Even with killing our ingress object, pod monitors, and readinessprobe, thus killing any sources of traffic that would be originating to this test app, we STILL get constant 503's. So what traces is Envoy generating for an app with no ingress/egress traffic? Istio proxy logs are completely blank except for these nonstop 503's. We sure there isn't a problem here?

joaopgrassi · 2024-09-24T15:06:06Z

To add, what I observed and was able to reproduce was: I configured envoy to export the traces to a "dummy" OTLP back-end that returned always HTTP 429s and Envoy somehow interprets that as 503's and seems to constantly retry to send it. I sent a single request to Envoy but the Envoy logs are filled with these (non-stop)

|

joaopgrassi · 2024-09-24T16:15:28Z

Actually I think I found what it is :D

I ran a bare bones Envoy and while debugging, I noticed it was always calling the log function on the exporter https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L189

But the OTLP request is empty. There's just a resource there and it sends those non-stop. I pointed Envoy to webhook.site and immediately got 429s.

The problem is that we seem to still export when there's no spans. It is simply exporting if there's an exporter configured
https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L187

So the fix here should be to only export when there's an exporter AND there are spans to export. @jbilliau-rcd The reason I think you run into the Dynatrace 429s is this. And I also saw that by default the "timer" to export is called every 5ms? Not sure what to make of this line https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L161.

I will send a PR to address this.

jbilliau-rcd · 2024-09-25T15:43:27Z

@joaopgrassi one last questions of you; if we were to proceed with Istio 1.22 rollout in spite of these errors, would we have issues with reporting of spans/traces to Dynatrace? Meaning, if Envoy is constantly trying to export spans, even when empty, it results in Dynatrace 429'ing, right? Would that then cause ACTUAL spans/traces we want from calls to also be 429's, since DT is already rate limiting us from the garbage empty ones?

joaopgrassi · 2024-10-01T14:43:22Z

results in Dynatrace 429'ing, right? Would that then cause ACTUAL spans/traces we want from calls to also be 429's, since DT is already rate limiting us from the garbage empty ones?

Yes, exactly. The "empty" OTLP messages are still valid so they still are accepted by Dynatrace, thus consume your tenant's quota. I don't know from the top of my head, but it may be possible to filter those out if you export them to a OTel collector, and then from the collector to Dynatrace. If that's not an option, then I'm afraid you would need to wait for a release of the updated Istio.

What we can do is see if we can backport the fix to Envoy 1.30/1.31 releases and then let Istio just update that. Then you use the new Istio patch release.

…oxy#36313) Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes envoyproxy#35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>

joaopgrassi · 2024-10-01T15:14:28Z

I submitted backports for this issue on versions 1.30 and 1.31 #36405, #36404.

One thing I saw is that Envoy seems to have a way to configure the export timeout (tracing.opentelemetry.flush_interval_ms)
https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L196

So, in theory you could increase that to, say 1 minute or more. That then can help because it won't export that often, and will increase the chance of actually having real spans in there. I think this might help then with reducing the number of requests and Dynatrace throttling them. The problem is that I don't think this can be enabled via Istio. Maybe via an Envoy filter? But can be dangerous 🤔

…oxy#36313) Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes envoyproxy#35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>

Commit Message: opentelemetrytracer: avoid exporting when there are no spans Additional Description: Today, the OpenTelemetry tracer exports a OTLP request on each interval, even when there are no spans to be sent. The OTLP is empty, only containing the resource (and its attributes). Risk Level: Low Testing: Manual Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #35997] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>

joaopgrassi · 2024-10-02T12:12:53Z

@jbilliau-rcd the backports I opened yesterday are already merged and probably soon a new Istio version is released to address this.

I understand the frustration of course. Lots of folks run the collector in between and then batching that works better and you would be surprised on how long organizations take to upgrade software, specially ones that are crucial like Istio/Envoy. This bug has been here since the OTel tracer was introduced which is years by now.

Trying to get things better here, I investigated more and I found a temporary solution that you can use to reduce the number of requests going out from Envoy and hopefully avoid going through the throttling in Dynatrace.

Envoy has the concept of Runtime configuration and that can be changed via Istio by deploying a ConfigMap.

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-custom-bootstrap-config
data:
  custom_bootstrap.json: |
    "layered_runtime": {
        "layers": [
          {
            "name": "static_layer_0",
            "static_layer": {
              "tracing.opentelemetry.flush_interval_ms": 60000
              "tracing.opentelemetry.min_flush_spans": 512
            }
          }
        ]
      }
EOF

Then you add annotations to your pods definitions, which tells Istio that it should use this bootstrap config when injecting the Envoy proxies:

template:
  metadata:
    labels:
      app: liveness-http
      version: v1
    annotations:
      sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config" # ADD THIS ANNOTATIOn

The important thing is that the value of bootstrapOverride is the name of the ConfigMap above.

This changes the OTel exporter to export every minute OR when Envoy has 512 spans waiting to be exported. This should reduce the number of requests that envoy makes and then will contain real spans, as 1 minute should be enough time for a real load to generate spans. You can tweak this to your needs as well, like increase the export timeout even more, to like, 5 min. It highly depends on the load you expect and the number of spans that will be generated.

Once the backports are published, you can stop using this. Let me know if it works for you. If you still run into request throttling with this, please reach out to support and we can try increasing your limits as well.

joaopgrassi · 2024-10-02T12:15:05Z

I will open separated issues to allow configuring both flush_interval_ms and min_flush_spans together with the rest of the OTel configs instead of only via the Envoy runtime config. I believe this is important enough that should be easy for users to change.

jbilliau-rcd · 2024-10-02T16:19:27Z

@jbilliau-rcd the backports I opened yesterday are already merged and probably soon a new Istio version is released to address this.

I understand the frustration of course. Lots of folks run the collector in between and then batching that works better and you would be surprised on how long organizations take to upgrade software, specially ones that are crucial like Istio/Envoy. This bug has been here since the OTel tracer was introduced which is years by now.

Trying to get things better here, I investigated more and I found a temporary solution that you can use to reduce the number of requests going out from Envoy and hopefully avoid going through the throttling in Dynatrace.

Envoy has the concept of Runtime configuration and that can be changed via Istio by deploying a ConfigMap.
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-custom-bootstrap-config
data:
  custom_bootstrap.json: |
    "layered_runtime": {
        "layers": [
          {
            "name": "static_layer_0",
            "static_layer": {
              "tracing.opentelemetry.flush_interval_ms": 60000
              "tracing.opentelemetry.min_flush_spans": 512
            }
          }
        ]
      }
EOF
Then you add annotations to your pods definitions, which tells Istio that it should use this bootstrap config when injecting the Envoy proxies:
template:
  metadata:
    labels:
      app: liveness-http
      version: v1
    annotations:
      sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config" # ADD THIS ANNOTATIOn
The important thing is that the value of bootstrapOverride is the name of the ConfigMap above.

This changes the OTel exporter to export every minute OR when Envoy has 512 spans waiting to be exported. This should reduce the number of requests that envoy makes and then will contain real spans, as 1 minute should be enough time for a real load to generate spans. You can tweak this to your needs as well, like increase the export timeout even more, to like, 5 min. It highly depends on the load you expect and the number of spans that will be generated.

Once the backports are published, you can stop using this. Let me know if it works for you. If you still run into request throttling with this, please reach out to support and we can try increasing your limits as well.

So if this configures the OTEL exported to export every minute OR when it has 512 spans....that would completely eliminate empty spans, correct? Heck, even if I tell is to wait til it has 2 spans...that's still better than then empty ones.

This is good stuff, thanks for all the deep diving into this, it's been a huge help and the only thing lifting the block that we've had on this migration for 2 months now.

joaopgrassi · 2024-10-03T08:15:09Z

So if this configures the OTEL exported to export every minute OR when it has 512 spans....that would completely eliminate empty spans, correct? Heck, even if I tell is to wait til it has 2 spans...that's still better than then empty ones.

Not quite. flush_interval_ms acts like a timer, so when the time configured there passes, an export happens, regardless of the number of spans. If there's 0, because of the bug an empty request is still executed. The default is 5s (I believe), so changing it to say 1, or 5 min then basically reduces drastically the number of requests it would make during the configured time, but won't reduce that empty spans. But it for sure will help with the 429/503 you see, so it should "fix" it, and you won't get the logs anymore.

min_flush_spans there then acts just as another improvement. The exporter sends a request when either of two things are true:

The timer has elapsed
The number of buffered spans is higher than min_flush_spans

The default for min_flush_spans is 5, which is very low. So, by increasing it, you also reduce then the chance of having frequent requests. 512 here comes from the default similar value for the Batch Span Processor defined in the OTel specification https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor

jbilliau-rcd · 2024-10-03T12:57:06Z

Both those settings sounds really low....defaults should be much higher, or better yet, configurable (ideally through Istio, so we aren't writing custom configmaps and annotations all over the place). Not trying to give you another PR to do, just kinda speaking out loud. As per my response to you on another issue, we are going to go down the route of this OTEL collector first, since the scope of deploying annotations and configmaps to thousands of deployments across hundreds of clusters isn't the most feasible option (hence my first sentence).

jbilliau-rcd added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Sep 5, 2024

phlax added area/opentelemetry and removed triage Issue requires triage labels Sep 6, 2024

jbilliau-rcd mentioned this issue Sep 23, 2024

Sampling needs to be configurable for tracing, even if using custom sampler. Current code results in massive log spam istio/istio#53012

Closed

joaopgrassi mentioned this issue Sep 24, 2024

opentelemetrytracer: avoid exporting when there are no spans #36313

Merged

wbpcode closed this as completed in #36313 Sep 25, 2024

wbpcode closed this as completed in 49a4bcc Sep 25, 2024

joaopgrassi mentioned this issue Oct 1, 2024

[bp/1.30] opentelemetrytracer: avoid exporting when there are no spans #36404

Merged

joaopgrassi mentioned this issue Oct 1, 2024

[bp/1.31] opentelemetrytracer: avoid exporting when there are no spans #36405

Merged

joaopgrassi mentioned this issue Oct 1, 2024

[bp/1.29] opentelemetrytracer: avoid exporting when there are no spans #36408

Merged

joaopgrassi mentioned this issue Oct 2, 2024

opentelemetrytracer: allow configuring flush_interval_ms and min_flush_spans with tracer config #36422

Open

joaopgrassi mentioned this issue Oct 3, 2024

Documented setup of of OpenTelemetryTracingProvider breaks apps that write to those same URL's directly via https istio/istio#53382

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997

Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997

jbilliau-rcd commented Sep 5, 2024

phlax commented Sep 6, 2024

jbilliau-rcd commented Sep 23, 2024

joaopgrassi commented Sep 24, 2024

joaopgrassi commented Sep 24, 2024 •

edited

Loading

jbilliau-rcd commented Sep 25, 2024

joaopgrassi commented Oct 1, 2024

joaopgrassi commented Oct 1, 2024

joaopgrassi commented Oct 2, 2024

joaopgrassi commented Oct 2, 2024

jbilliau-rcd commented Oct 2, 2024 •

edited

Loading

joaopgrassi commented Oct 3, 2024 •

edited

Loading

jbilliau-rcd commented Oct 3, 2024

Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997

Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997

Comments

jbilliau-rcd commented Sep 5, 2024

phlax commented Sep 6, 2024

jbilliau-rcd commented Sep 23, 2024

joaopgrassi commented Sep 24, 2024

joaopgrassi commented Sep 24, 2024 • edited Loading

jbilliau-rcd commented Sep 25, 2024

joaopgrassi commented Oct 1, 2024

joaopgrassi commented Oct 1, 2024

joaopgrassi commented Oct 2, 2024

joaopgrassi commented Oct 2, 2024

jbilliau-rcd commented Oct 2, 2024 • edited Loading

joaopgrassi commented Oct 3, 2024 • edited Loading

jbilliau-rcd commented Oct 3, 2024

joaopgrassi commented Sep 24, 2024 •

edited

Loading

jbilliau-rcd commented Oct 2, 2024 •

edited

Loading

joaopgrassi commented Oct 3, 2024 •

edited

Loading