Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we make OTLP exports failing an optional thing to log? Current istio-proxy logs with Dynatrace are super spammy #35997

Closed
jbilliau-rcd opened this issue Sep 5, 2024 · 12 comments · Fixed by #36313
Labels
area/opentelemetry enhancement Feature requests. Not bugs or questions.

Comments

@jbilliau-rcd
Copy link

Current, envoy hardcodes the fact that a failing OLTP export must be written to ENVOY_LOG, and thus, produced in the log. We are currently piloting istio 1.22 with Dynatrace and are constantly seeing this in the logs:

2024-09-05T18:28:10.787073363Z 2024-09-05T18:28:10.786914Z error envoy tracing external/envoy/source/extensions/tracers/opentelemetry/http_trace_exporter.cc:86 OTLP HTTP exporter received a non-success status code: 503 while exporting the OTLP message thread=23

This is being produced by this block of code -

void OpenTelemetryHttpTraceExporter::onSuccess(const Http::AsyncClient::Request& request,
Http::ResponseMessagePtr&& http_response) {
active_requests_.remove(request);
const auto response_code = Http::Utility::getResponseStatus(http_response->headers());
if (response_code != enumToInt(Http::Code::OK)) {
ENVOY_LOG(error,
"OTLP HTTP exporter received a non-success status code: {} while exporting the OTLP "
"message",
response_code);
}
}

PROPOSED FIX

Since Istio currently hardcodes a sample rate of 100 when using a custom OTEL provider (another issue I need to raise on their GitHub), and Dynatrace rate-limits OLTP exports, this makes parsing envoy logs very difficult since 70% of the logs are this 503 line. It would be nice if this could have a variable or argument to optionally "fail silent" since I don't really care if OTEL traces fail when sending to Dynatrace.

@jbilliau-rcd jbilliau-rcd added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Sep 5, 2024
@phlax
Copy link
Member

phlax commented Sep 6, 2024

cc @AlexanderEllis @yanavlasov

@jbilliau-rcd
Copy link
Author

Per our conversation with our Dyntrace support rep:

Our dev's have idenitified some notable behavior. As mentioned before, on the cluster/active gate side we were not seeing 503's as shown by the OTel exporter. This was the main source of confusion, but while 503's were not present, we were reporting large spikes in 429 errors. This throttling error lines up far more with the envoy side spam. The reasoning for why the 429's appear as 503s seem to be a envoy/otel quirk. Searching online for similar issues has shown some similar cases where when the export endpoint stops responding, the 503 appears rather than the expected result. Unfortunately there is not much information on why this is, but it is envoy side and not a large focus.

Would you be able to setup the exporter based on the previously shared documentation? This should help provide more robust artifacts showing where the traffic is coming from. In addition, I would work further on the Envoy/Otel side via a github issue or so as there seems to be a problematic behavior with the tracing.
https://istio.io/latest/docs/tasks/observability/logs/otel-provider/

Even with killing our ingress object, pod monitors, and readinessprobe, thus killing any sources of traffic that would be originating to this test app, we STILL get constant 503's. So what traces is Envoy generating for an app with no ingress/egress traffic? Istio proxy logs are completely blank except for these nonstop 503's. We sure there isn't a problem here?

@joaopgrassi
Copy link
Contributor

To add, what I observed and was able to reproduce was: I configured envoy to export the traces to a "dummy" OTLP back-end that returned always HTTP 429s and Envoy somehow interprets that as 503's and seems to constantly retry to send it. I sent a single request to Envoy but the Envoy logs are filled with these (non-stop)

image
|

@joaopgrassi
Copy link
Contributor

joaopgrassi commented Sep 24, 2024

Actually I think I found what it is :D

I ran a bare bones Envoy and while debugging, I noticed it was always calling the log function on the exporter https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L189

But the OTLP request is empty. There's just a resource there and it sends those non-stop. I pointed Envoy to webhook.site and immediately got 429s.

image

The problem is that we seem to still export when there's no spans. It is simply exporting if there's an exporter configured
https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L187

So the fix here should be to only export when there's an exporter AND there are spans to export. @jbilliau-rcd The reason I think you run into the Dynatrace 429s is this. And I also saw that by default the "timer" to export is called every 5ms? Not sure what to make of this line https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L161.

I will send a PR to address this.

@jbilliau-rcd
Copy link
Author

@joaopgrassi one last questions of you; if we were to proceed with Istio 1.22 rollout in spite of these errors, would we have issues with reporting of spans/traces to Dynatrace? Meaning, if Envoy is constantly trying to export spans, even when empty, it results in Dynatrace 429'ing, right? Would that then cause ACTUAL spans/traces we want from calls to also be 429's, since DT is already rate limiting us from the garbage empty ones?

@joaopgrassi
Copy link
Contributor

results in Dynatrace 429'ing, right? Would that then cause ACTUAL spans/traces we want from calls to also be 429's, since DT is already rate limiting us from the garbage empty ones?

Yes, exactly. The "empty" OTLP messages are still valid so they still are accepted by Dynatrace, thus consume your tenant's quota. I don't know from the top of my head, but it may be possible to filter those out if you export them to a OTel collector, and then from the collector to Dynatrace. If that's not an option, then I'm afraid you would need to wait for a release of the updated Istio.

What we can do is see if we can backport the fix to Envoy 1.30/1.31 releases and then let Istio just update that. Then you use the new Istio patch release.

joaopgrassi added a commit to dynatrace-oss-contrib/envoy that referenced this issue Oct 1, 2024
…oxy#36313)

Commit Message: opentelemetrytracer: avoid exporting when there are no
spans
Additional Description: Today, the OpenTelemetry tracer exports a OTLP
request on each interval, even when there are no spans to be sent. The
OTLP is empty, only containing the resource (and its attributes).
Risk Level: Low
Testing: Manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes envoyproxy#35997]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
joaopgrassi added a commit to dynatrace-oss-contrib/envoy that referenced this issue Oct 1, 2024
…oxy#36313)

Commit Message: opentelemetrytracer: avoid exporting when there are no
spans
Additional Description: Today, the OpenTelemetry tracer exports a OTLP
request on each interval, even when there are no spans to be sent. The
OTLP is empty, only containing the resource (and its attributes).
Risk Level: Low
Testing: Manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes envoyproxy#35997]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
@joaopgrassi
Copy link
Contributor

I submitted backports for this issue on versions 1.30 and 1.31 #36405, #36404.

One thing I saw is that Envoy seems to have a way to configure the export timeout (tracing.opentelemetry.flush_interval_ms)
https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/opentelemetry/tracer.cc#L196

So, in theory you could increase that to, say 1 minute or more. That then can help because it won't export that often, and will increase the chance of actually having real spans in there. I think this might help then with reducing the number of requests and Dynatrace throttling them. The problem is that I don't think this can be enabled via Istio. Maybe via an Envoy filter? But can be dangerous 🤔

joaopgrassi added a commit to dynatrace-oss-contrib/envoy that referenced this issue Oct 1, 2024
…oxy#36313)

Commit Message: opentelemetrytracer: avoid exporting when there are no
spans
Additional Description: Today, the OpenTelemetry tracer exports a OTLP
request on each interval, even when there are no spans to be sent. The
OTLP is empty, only containing the resource (and its attributes).
Risk Level: Low
Testing: Manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes envoyproxy#35997]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
phlax pushed a commit that referenced this issue Oct 1, 2024
Commit Message: opentelemetrytracer: avoid exporting when there are no
spans
Additional Description: Today, the OpenTelemetry tracer exports a OTLP
request on each interval, even when there are no spans to be sent. The
OTLP is empty, only containing the resource (and its attributes).
Risk Level: Low
Testing: Manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #35997]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
phlax pushed a commit that referenced this issue Oct 1, 2024
Commit Message: opentelemetrytracer: avoid exporting when there are no
spans
Additional Description: Today, the OpenTelemetry tracer exports a OTLP
request on each interval, even when there are no spans to be sent. The
OTLP is empty, only containing the resource (and its attributes).
Risk Level: Low
Testing: Manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #35997]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
phlax pushed a commit that referenced this issue Oct 1, 2024
Commit Message: opentelemetrytracer: avoid exporting when there are no
spans
Additional Description: Today, the OpenTelemetry tracer exports a OTLP
request on each interval, even when there are no spans to be sent. The
OTLP is empty, only containing the resource (and its attributes).
Risk Level: Low
Testing: Manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #35997]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

Signed-off-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
@joaopgrassi
Copy link
Contributor

@jbilliau-rcd the backports I opened yesterday are already merged and probably soon a new Istio version is released to address this.

I understand the frustration of course. Lots of folks run the collector in between and then batching that works better and you would be surprised on how long organizations take to upgrade software, specially ones that are crucial like Istio/Envoy. This bug has been here since the OTel tracer was introduced which is years by now.

Trying to get things better here, I investigated more and I found a temporary solution that you can use to reduce the number of requests going out from Envoy and hopefully avoid going through the throttling in Dynatrace.

Envoy has the concept of Runtime configuration and that can be changed via Istio by deploying a ConfigMap.

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-custom-bootstrap-config
data:
  custom_bootstrap.json: |
    "layered_runtime": {
        "layers": [
          {
            "name": "static_layer_0",
            "static_layer": {
              "tracing.opentelemetry.flush_interval_ms": 60000
              "tracing.opentelemetry.min_flush_spans": 512
            }
          }
        ]
      }
EOF

Then you add annotations to your pods definitions, which tells Istio that it should use this bootstrap config when injecting the Envoy proxies:

template:
  metadata:
    labels:
      app: liveness-http
      version: v1
    annotations:
      sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config" # ADD THIS ANNOTATIOn

The important thing is that the value of bootstrapOverride is the name of the ConfigMap above.

This changes the OTel exporter to export every minute OR when Envoy has 512 spans waiting to be exported. This should reduce the number of requests that envoy makes and then will contain real spans, as 1 minute should be enough time for a real load to generate spans. You can tweak this to your needs as well, like increase the export timeout even more, to like, 5 min. It highly depends on the load you expect and the number of spans that will be generated.

Once the backports are published, you can stop using this. Let me know if it works for you. If you still run into request throttling with this, please reach out to support and we can try increasing your limits as well.

@joaopgrassi
Copy link
Contributor

I will open separated issues to allow configuring both flush_interval_ms and min_flush_spans together with the rest of the OTel configs instead of only via the Envoy runtime config. I believe this is important enough that should be easy for users to change.

@jbilliau-rcd
Copy link
Author

jbilliau-rcd commented Oct 2, 2024

@jbilliau-rcd the backports I opened yesterday are already merged and probably soon a new Istio version is released to address this.

I understand the frustration of course. Lots of folks run the collector in between and then batching that works better and you would be surprised on how long organizations take to upgrade software, specially ones that are crucial like Istio/Envoy. This bug has been here since the OTel tracer was introduced which is years by now.

Trying to get things better here, I investigated more and I found a temporary solution that you can use to reduce the number of requests going out from Envoy and hopefully avoid going through the throttling in Dynatrace.

Envoy has the concept of Runtime configuration and that can be changed via Istio by deploying a ConfigMap.

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-custom-bootstrap-config
data:
  custom_bootstrap.json: |
    "layered_runtime": {
        "layers": [
          {
            "name": "static_layer_0",
            "static_layer": {
              "tracing.opentelemetry.flush_interval_ms": 60000
              "tracing.opentelemetry.min_flush_spans": 512
            }
          }
        ]
      }
EOF

Then you add annotations to your pods definitions, which tells Istio that it should use this bootstrap config when injecting the Envoy proxies:

template:
  metadata:
    labels:
      app: liveness-http
      version: v1
    annotations:
      sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config" # ADD THIS ANNOTATIOn

The important thing is that the value of bootstrapOverride is the name of the ConfigMap above.

This changes the OTel exporter to export every minute OR when Envoy has 512 spans waiting to be exported. This should reduce the number of requests that envoy makes and then will contain real spans, as 1 minute should be enough time for a real load to generate spans. You can tweak this to your needs as well, like increase the export timeout even more, to like, 5 min. It highly depends on the load you expect and the number of spans that will be generated.

Once the backports are published, you can stop using this. Let me know if it works for you. If you still run into request throttling with this, please reach out to support and we can try increasing your limits as well.

So if this configures the OTEL exported to export every minute OR when it has 512 spans....that would completely eliminate empty spans, correct? Heck, even if I tell is to wait til it has 2 spans...that's still better than then empty ones.

This is good stuff, thanks for all the deep diving into this, it's been a huge help and the only thing lifting the block that we've had on this migration for 2 months now.

@joaopgrassi
Copy link
Contributor

joaopgrassi commented Oct 3, 2024

So if this configures the OTEL exported to export every minute OR when it has 512 spans....that would completely eliminate empty spans, correct? Heck, even if I tell is to wait til it has 2 spans...that's still better than then empty ones.

Not quite. flush_interval_ms acts like a timer, so when the time configured there passes, an export happens, regardless of the number of spans. If there's 0, because of the bug an empty request is still executed. The default is 5s (I believe), so changing it to say 1, or 5 min then basically reduces drastically the number of requests it would make during the configured time, but won't reduce that empty spans. But it for sure will help with the 429/503 you see, so it should "fix" it, and you won't get the logs anymore.

min_flush_spans there then acts just as another improvement. The exporter sends a request when either of two things are true:

  1. The timer has elapsed
  2. The number of buffered spans is higher than min_flush_spans

The default for min_flush_spans is 5, which is very low. So, by increasing it, you also reduce then the chance of having frequent requests. 512 here comes from the default similar value for the Batch Span Processor defined in the OTel specification https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor

@jbilliau-rcd
Copy link
Author

Both those settings sounds really low....defaults should be much higher, or better yet, configurable (ideally through Istio, so we aren't writing custom configmaps and annotations all over the place). Not trying to give you another PR to do, just kinda speaking out loud. As per my response to you on another issue, we are going to go down the route of this OTEL collector first, since the scope of deploying annotations and configmaps to thousands of deployments across hundreds of clusters isn't the most feasible option (hence my first sentence).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/opentelemetry enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants