Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OpenTelemetry to 0.19.0 #2878

Closed
3 of 6 tasks
BrynCooke opened this issue Mar 29, 2023 · 3 comments · Fixed by #3196 or #3421
Closed
3 of 6 tasks

Update OpenTelemetry to 0.19.0 #2878

BrynCooke opened this issue Mar 29, 2023 · 3 comments · Fixed by #3196 or #3421
Assignees

Comments

@BrynCooke
Copy link
Contributor

BrynCooke commented Mar 29, 2023

The v0.19.0 packages were released which fixes a bunch of issues, many of which were impacting us: https://github.com/open-telemetry/opentelemetry-rust/releases/tag/v0.19.0

Tasks

  1. BrynCooke
  2. component/open-telemetry raised by user
    BrynCooke
  3. bug raised by user
    garypen
@abernix abernix changed the title Update open telemetry to 0.19.0 Update OpenTelemetry to 0.19.0 Mar 29, 2023
@BrynCooke
Copy link
Contributor Author

Blocked on tokio-rs/tracing-opentelemetry#12

@Geal
Copy link
Contributor

Geal commented May 26, 2023

@garypen garypen self-assigned this May 31, 2023
garypen added a commit that referenced this issue Jun 5, 2023
The update requires a change to the implementation and test update as
follows:

- In otel 0.18.0, processor factories had a `with_memory(bool)` method
which we were using when building our prometheus exporter. AFAICT, this
used to be a mechanism for controlling how metrics handled stale gauges.
In 0.19.0, [this method was
removed](open-telemetry/opentelemetry-rust#946)
and now gauges are all assumed to be as though they were created with
`false`. We had been providing `true` on our call. I'm not 100% certain
of the impact of this change, but it appears that we can ignore it. We
may need to consider it more carefully if problems arise.
- There are now two standard OTEL attributes:
```otel_scope_name="apollo/router",otel_scope_version=""``` added to
output and a number of tests had to be updated to accommodate that
change.
- One of our tests appeared to be searching for
`apollo_router_cache_hit_count` (and this was working) when it should
have been searching for `apollo_router_cache_hit_count_total` (likewise
for miss). I've updated the test and think this is the correct thing to
do. It looks like a bug was fixed in otel and this change matches the
fix.

The upgrade fixes many of the outstanding issues related to
opentelemetry and various APM vendors:

Fixes: #2878
Fixes: #2066 
Fixes: #2959 
Fixes: #2225 
Fixes: #1520 

<!-- start metadata -->

**Checklist**

Complete the checklist (and note appropriate exceptions) before a final
PR is raised.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [x] Unit Tests
    - [x] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]. It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding
Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or
ask for it to be labeled) as `manual test`
@o0Ignition0o o0Ignition0o reopened this Jun 15, 2023
@o0Ignition0o
Copy link
Contributor

o0Ignition0o commented Jun 15, 2023

Reopening the issue since we have to revert the upgrade until they release a patch. See #3242

@garypen garypen mentioned this issue Jul 12, 2023
6 tasks
garypen added a commit that referenced this issue Jul 12, 2023
The update requires a change to the implementation and test update as
follows:

- In otel 0.18.0, processor factories had a `with_memory(bool)` method
which we were using when building our prometheus exporter. AFAICT, this
used to be a mechanism for controlling how metrics handled stale gauges.
In 0.19.0, [this method was
removed](open-telemetry/opentelemetry-rust#946)
and now gauges are all assumed to be as though they were created with
`false`. We had been providing `true` on our call. I'm not 100% certain
of the impact of this change, but it appears that we can ignore it. We
may need to consider it more carefully if problems arise.
- There are now two standard OTEL attributes:
```otel_scope_name="apollo/router",otel_scope_version=""``` added to
output and a number of tests had to be updated to accommodate that
change.
- One of our tests appeared to be searching for
`apollo_router_cache_hit_count` (and this was working) when it should
have been searching for `apollo_router_cache_hit_count_total` (likewise
for miss). I've updated the test and think this is the correct thing to
do. It looks like a bug was fixed in otel and this change matches the
fix.
 
Regarding that last point. The prometheus spec mandates naming format
and the change was part of the compliance with that spec. This PR made
the change:
open-telemetry/opentelemetry-rust#952

The two affected counters in the router were:

apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total
apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total

It's good that our prometheus metrics are now spec compliant, but we
should note this in the release notes and (if possible) somewhere in our
documentation. I'll add it to the changeset at least.

The upgrade fixes many of the outstanding issues related to
opentelemetry and various APM vendors:

Fixes: #2878
Fixes: #2066 
Fixes: #2959 
Fixes: #2225 
Fixes: #1520 

<!-- start metadata -->

**Checklist**

Complete the checklist (and note appropriate exceptions) before a final
PR is raised.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [x] Unit Tests
    - [x] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]. It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding
Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or
ask for it to be labeled) as `manual test`
bnjjj added a commit that referenced this issue Jul 13, 2023
> **Note**
>
> When approved, this PR will merge into **the `1.24.0` branch** which
will — upon being approved itself — merge into `main`.
>
> **Things to review in this PR**:
> - Changelog correctness (There is a preview below, but it is not
necessarily the most up to date. See the _Files Changed_ for the true
reality.)
>  - Version bumps
>  - That it targets the right release branch (`1.24.0` in this case!).
>
---
## 🚀 Features

### Add support for delta aggregation to otlp metrics ([PR
#3412](#3412))

Add a new configuration option (Temporality) to the otlp metrics
configuration.

This may be useful to fix problems with metrics when being processed by
datadog which tends to expect Delta, rather than Cumulative,
aggregations.

See:
-
open-telemetry/opentelemetry-collector-contrib#6129
 - DataDog/documentation#15840

for more details.

By [@garypen](https://github.com/garypen) in
#3412

## 🐛 Fixes

### Fix error handling for subgraphs ([Issue
#3141](#3141))

The GraphQL spec is rather light on what should happen when we process
responses from subgraphs. The current behaviour within the Router was
inconsistently short circuiting response processing and this producing
confusing errors.
> #### Processing the response
> 
> If the response uses a non-200 status code and the media type of the
response payload is application/json then the client MUST NOT rely on
the body to be a well-formed GraphQL response since the source of the
response may not be the server but instead some intermediary such as API
gateways, proxies, firewalls, etc.

The logic has been simplified and made consistent using the following
rules:
1. If the content type of the response is not `application/json` or
`application/graphql-response+json` then we won't try to parse.
2. If an HTTP status is not 2xx it will always be attached as a graphql
error.
3. If the response type is `application/json` and status is not 2xx and
the body is not valid grapqhql the entire subgraph response will be
attached as an error.

By [@BrynCooke](https://github.com/BrynCooke) in
#3328

## 🛠 Maintenance

### chore: router-bridge 0.3.0+v2.4.8 -> =0.3.1+2.4.9 ([PR
#3407](#3407))

Updates `router-bridge` from ` = "0.3.0+v2.4.8"` to ` = "0.3.1+v2.4.9"`,
note that with this PR, this dependency is now pinned to an exact
version. This version update started failing tests because of a minor
ordering change and it was not immediately clear why the test was
failing. Pinning this dependency (that we own) allows us to only bring
in the update at the proper time and will make test failures caused by
the update to be more easily identified.

By [@EverlastingBugstopper](https://github.com/EverlastingBugstopper) in
#3407

### remove the compiler from Query ([Issue
#3373](#3373))

The `Query` object caches information extracted from the query that is
used to format responses. It was carrying an `ApolloCompiler` instance,
but now we don't really need it anymore, since it is now cached at the
query analysis layer. We also should not carry it in the supergraph
request and execution request, because that makes the builders hard to
manipulate for plugin authors. Since we are not exposing the compiler in
the public API yet, we move it inside the context's private entries,
where it will be easily accessible from internal code.

By [@Geal](https://github.com/Geal) in
#3367

### move AllowOnlyHttpPostMutationsLayer at the supergraph service level
([PR #3374](#3374), [PR
#3410](#3410))

Now that we have access to a compiler in supergraph requests, we don't
need to look into the query plan to know if a request contains mutations

By [@Geal](https://github.com/Geal) in
#3374 &
#3410

### update opentelemetry to 0.19.0 ([Issue
#2878](#2878))


We've updated the following opentelemetry related crates:

```
opentelemetry 0.18.0 -> 0.19.0
opentelemetry-datadog 0.6.0 -> 0.7.0
opentelemetry-http 0.7.0 -> 0.8.0
opentelemetry-jaeger 0.17.0 -> 0.18.0
opentelemetry-otlp 0.11.0 -> 0.12.0
opentelemetry-semantic-conventions 0.10.0 -> 0.11.0
opentelemetry-zipkin 0.16.0 -> 0.17.0
opentelemetry-prometheus 0.11.0 -> 0.12.0
tracing-opentelemetry 0.18.0 -> 0.19.0
```

This allows us to close a number of opentelemetry related issues.

Note:

The prometheus specification mandates naming format and, unfortunately,
the router had two metrics which weren't compliant. The otel upgrade
enforces the specification, so the affected metrics are now renamed (see
below).

The two affected metrics in the router were:

apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total
apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total

If you are monitoring these metrics via prometheus, please update your
dashboards with this name change.

By [@garypen](https://github.com/garypen) in
#3421

### Synthesize defer labels without RNG or collisions ([PR
#3381](#3381) and [PR
#3423](#3423))

The `@defer` directive accepts a `label` argument, but it is optional.
To more accurately handle deferred responses, the Router internally
rewrites queries to add labels on the `@defer` directive where they are
missing. Responses eventually receive the reverse treatment to look as
expected by client.

This was done be generating random strings, handling collision with
existing labels, and maintaining a `HashSet` of which labels had been
synthesized. Instead, we now add a prefix to pre-existing labels and
generate new labels without it. When processing a response, the absence
of that prefix indicates a synthetic label.

By [@SimonSapin](https://github.com/SimonSapin) and
[@o0Ignition0o](https://github.com/o0Ignition0o) in
#3381 and
#3423

### Move subscription event execution at the execution service level
([PR #3395](#3395))

In order to prepare some future integration I moved the execution loop
for subscription events at the execution_service level.

By [@bnjjj](https://github.com/bnjjj) in
#3395

## 📚 Documentation

### Document claim augmentation via coprocessors ([Issue
#3102](#3102))

Claims augmentation is a common use case where user information from the
JWT claims is used to look up more context like roles from databases,
before sending it to subgraphs. This can be done with subgraphs, but it
was not documented yet, and there was confusion on the order in which
the plugins were called. This clears the confusion and provides an
example configuration.

By [@Geal](https://github.com/Geal) in
#3386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants