otel 0.19.0 second try #3421

garypen · 2023-07-12T13:24:19Z

The update requires a change to the implementation and test update as follows:

In otel 0.18.0, processor factories had a with_memory(bool) method which we were using when building our prometheus exporter. AFAICT, this used to be a mechanism for controlling how metrics handled stale gauges. In 0.19.0, this method was removed and now gauges are all assumed to be as though they were created with false. We had been providing true on our call. I'm not 100% certain of the impact of this change, but it appears that we can ignore it. We may need to consider it more carefully if problems arise.
There are now two standard OTEL attributes: otel_scope_name="apollo/router",otel_scope_version="" added to output and a number of tests had to be updated to accommodate that change.
One of our tests appeared to be searching for apollo_router_cache_hit_count (and this was working) when it should have been searching for apollo_router_cache_hit_count_total (likewise for miss). I've updated the test and think this is the correct thing to do. It looks like a bug was fixed in otel and this change matches the fix.

Regarding that last point. The prometheus spec mandates naming format and the change was part of the compliance with that spec. This PR made the change: open-telemetry/opentelemetry-rust#952

The two affected counters in the router were:

apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total
apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total

It's good that our prometheus metrics are now spec compliant, but we should note this in the release notes and (if possible) somewhere in our documentation. I'll add it to the changeset at least.

The upgrade fixes many of the outstanding issues related to opentelemetry and various APM vendors:

Fixes: #2878
Fixes: #2066
Fixes: #2959
Fixes: #2225
Fixes: #1520

Checklist

Complete the checklist (and note appropriate exceptions) before a final PR is raised.

Exceptions

Note any exceptions here

Notes

[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as manual test

revert the revert

sigh...

router-perf · 2023-07-12T13:24:51Z

apollo-router/tests/metrics_tests.rs

BrynCooke

Approved pending the change to metrics_tests.rs

unless you set environment variables...

@garypen

> **Note** > > When approved, this PR will merge into **the `1.24.0` branch** which will — upon being approved itself — merge into `main`. > > **Things to review in this PR**: > - Changelog correctness (There is a preview below, but it is not necessarily the most up to date. See the _Files Changed_ for the true reality.) > - Version bumps > - That it targets the right release branch (`1.24.0` in this case!). > --- ## 🚀 Features ### Add support for delta aggregation to otlp metrics ([PR #3412](#3412)) Add a new configuration option (Temporality) to the otlp metrics configuration. This may be useful to fix problems with metrics when being processed by datadog which tends to expect Delta, rather than Cumulative, aggregations. See: - open-telemetry/opentelemetry-collector-contrib#6129 - DataDog/documentation#15840 for more details. By [@garypen](https://github.com/garypen) in #3412 ## 🐛 Fixes ### Fix error handling for subgraphs ([Issue #3141](#3141)) The GraphQL spec is rather light on what should happen when we process responses from subgraphs. The current behaviour within the Router was inconsistently short circuiting response processing and this producing confusing errors. > #### Processing the response > > If the response uses a non-200 status code and the media type of the response payload is application/json then the client MUST NOT rely on the body to be a well-formed GraphQL response since the source of the response may not be the server but instead some intermediary such as API gateways, proxies, firewalls, etc. The logic has been simplified and made consistent using the following rules: 1. If the content type of the response is not `application/json` or `application/graphql-response+json` then we won't try to parse. 2. If an HTTP status is not 2xx it will always be attached as a graphql error. 3. If the response type is `application/json` and status is not 2xx and the body is not valid grapqhql the entire subgraph response will be attached as an error. By [@BrynCooke](https://github.com/BrynCooke) in #3328 ## 🛠 Maintenance ### chore: router-bridge 0.3.0+v2.4.8 -> =0.3.1+2.4.9 ([PR #3407](#3407)) Updates `router-bridge` from ` = "0.3.0+v2.4.8"` to ` = "0.3.1+v2.4.9"`, note that with this PR, this dependency is now pinned to an exact version. This version update started failing tests because of a minor ordering change and it was not immediately clear why the test was failing. Pinning this dependency (that we own) allows us to only bring in the update at the proper time and will make test failures caused by the update to be more easily identified. By [@EverlastingBugstopper](https://github.com/EverlastingBugstopper) in #3407 ### remove the compiler from Query ([Issue #3373](#3373)) The `Query` object caches information extracted from the query that is used to format responses. It was carrying an `ApolloCompiler` instance, but now we don't really need it anymore, since it is now cached at the query analysis layer. We also should not carry it in the supergraph request and execution request, because that makes the builders hard to manipulate for plugin authors. Since we are not exposing the compiler in the public API yet, we move it inside the context's private entries, where it will be easily accessible from internal code. By [@Geal](https://github.com/Geal) in #3367 ### move AllowOnlyHttpPostMutationsLayer at the supergraph service level ([PR #3374](#3374), [PR #3410](#3410)) Now that we have access to a compiler in supergraph requests, we don't need to look into the query plan to know if a request contains mutations By [@Geal](https://github.com/Geal) in #3374 & #3410 ### update opentelemetry to 0.19.0 ([Issue #2878](#2878)) We've updated the following opentelemetry related crates: ``` opentelemetry 0.18.0 -> 0.19.0 opentelemetry-datadog 0.6.0 -> 0.7.0 opentelemetry-http 0.7.0 -> 0.8.0 opentelemetry-jaeger 0.17.0 -> 0.18.0 opentelemetry-otlp 0.11.0 -> 0.12.0 opentelemetry-semantic-conventions 0.10.0 -> 0.11.0 opentelemetry-zipkin 0.16.0 -> 0.17.0 opentelemetry-prometheus 0.11.0 -> 0.12.0 tracing-opentelemetry 0.18.0 -> 0.19.0 ``` This allows us to close a number of opentelemetry related issues. Note: The prometheus specification mandates naming format and, unfortunately, the router had two metrics which weren't compliant. The otel upgrade enforces the specification, so the affected metrics are now renamed (see below). The two affected metrics in the router were: apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total If you are monitoring these metrics via prometheus, please update your dashboards with this name change. By [@garypen](https://github.com/garypen) in #3421 ### Synthesize defer labels without RNG or collisions ([PR #3381](#3381) and [PR #3423](#3423)) The `@defer` directive accepts a `label` argument, but it is optional. To more accurately handle deferred responses, the Router internally rewrites queries to add labels on the `@defer` directive where they are missing. Responses eventually receive the reverse treatment to look as expected by client. This was done be generating random strings, handling collision with existing labels, and maintaining a `HashSet` of which labels had been synthesized. Instead, we now add a prefix to pre-existing labels and generate new labels without it. When processing a response, the absence of that prefix indicates a synthetic label. By [@SimonSapin](https://github.com/SimonSapin) and [@o0Ignition0o](https://github.com/o0Ignition0o) in #3381 and #3423 ### Move subscription event execution at the execution service level ([PR #3395](#3395)) In order to prepare some future integration I moved the execution loop for subscription events at the execution_service level. By [@bnjjj](https://github.com/bnjjj) in #3395 ## 📚 Documentation ### Document claim augmentation via coprocessors ([Issue #3102](#3102)) Claims augmentation is a common use case where user information from the JWT claims is used to look up more context like roles from databases, before sending it to subgraphs. This can be done with subgraphs, but it was not documented yet, and there was confusion on the order in which the plugins were called. This clears the confusion and provides an example configuration. By [@Geal](https://github.com/Geal) in #3386

garypen added 7 commits June 15, 2023 15:22

update otel to 0.19.0

5b38e27

revert the revert

Merge branch 'dev' into garypen/otel-0.19.0

685f6d8

cargo check

47e6448

Merge branch 'dev' into garypen/otel-0.19.0

02d1321

cargo lint

a5164b6

add a changeset

9289226

remove duplicate changeset

197be62

sigh...

garypen requested review from BrynCooke, bnjjj and o0Ignition0o July 12, 2023 13:24

garypen self-assigned this Jul 12, 2023

update PR number

dd1aaf1

BrynCooke reviewed Jul 12, 2023

View reviewed changes

apollo-router/tests/metrics_tests.rs Show resolved Hide resolved

add a note about the renamed metrics

2d843c7

BrynCooke approved these changes Jul 12, 2023

View reviewed changes

bnjjj approved these changes Jul 12, 2023

View reviewed changes

garypen added 2 commits July 12, 2023 15:10

fix test which doesn't run as part of our test suite

8bdf325

unless you set environment variables...

Merge branch 'dev' into garypen/otel-0.19.0

d12bc81

garypen enabled auto-merge (squash) July 12, 2023 14:13

garypen merged commit d3f37cd into dev Jul 12, 2023

garypen deleted the garypen/otel-0.19.0 branch July 12, 2023 14:32

bnjjj mentioned this pull request Jul 13, 2023

prep release: v1.24.0 #3431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otel 0.19.0 second try #3421

otel 0.19.0 second try #3421

garypen commented Jul 12, 2023 •

edited

Loading

router-perf bot commented Jul 12, 2023

BrynCooke left a comment

otel 0.19.0 second try #3421

otel 0.19.0 second try #3421

Conversation

garypen commented Jul 12, 2023 • edited Loading

router-perf bot commented Jul 12, 2023

BrynCooke left a comment

Choose a reason for hiding this comment

garypen commented Jul 12, 2023 •

edited

Loading