Fix error handling for subgraphs #3328

BrynCooke · 2023-06-27T15:25:21Z

The graphql spec is lax about what strategy to use for processing responses: https://github.com/graphql/graphql-over-http/blob/main/spec/GraphQLOverHTTP.md#processing-the-response

If the response uses a non-200 status code and the media type of the response payload is application/json
then the client MUST NOT rely on the body to be a well-formed GraphQL response since the source of the response
may not be the server but instead some intermediary such as API gateways, proxies, firewalls, etc.

The TLDR of this is that it's really asking us to do the best we can with whatever information we have with some modifications depending on content type.
Our goal is to give the user the most relevant information possible in the response errors

Rules:

If the content type of the response is not application/json or application/graphql-response+json then we won't try to parse.
If an HTTP status is not 2xx it will always be attached as a graphql error.
If the response type is application/json and status is not 2xx and the body is not valid grapqhql then parse errors will be suppressed.

Rule #3 Is definitely up for debate.

If the response type is application/json and status is not 2xx and the body is not valid grapqhql then parse errors will be suppressed.

Alternative are that:

an error is attached with the entire contents of the response.
the response is logged as a WARN.

Fixes #3141

Checklist

Complete the checklist (and note appropriate exceptions) before a final PR is raised.

Exceptions

Note any exceptions here

Notes

[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as manual test

router-perf · 2023-06-27T15:25:52Z

BrynCooke · 2023-06-28T11:07:17Z

Note to reviewers. I think the unit tests need refactoring to make them dry, but I haven't got the time to do this right now.
I have just used the existing pattern.

.changesets/fix_taffy_crank_flier_seaweed.md

apollo-router/src/services/subgraph_service.rs

The graphql spec is lax about what strategy to use for processing responses: https://github.com/graphql/graphql-over-http/blob/main/spec/GraphQLOverHTTP.md#processing-the-response "If the response uses a non-200 status code and the media type of the response payload is application/json then the client MUST NOT rely on the body to be a well-formed GraphQL response since the source of the response may not be the server but instead some intermediary such as API gateways, proxies, firewalls, etc." The TLDR of this is that it's really asking us to do the best we can with whatever information we have with some modifications depending on content type. Our goal is to give the user the most relevant information possible in the response errors Rules: 1. If the content type of the response is not `application/json` or `application/graphql-response+json` then we won't try to parse. 2. If an HTTP status is not 2xx it will always be attached as a graphql error. 3. If the response type is `application/json` and status is not 2xx and the body is not valid grapqhql then parse errors will be suppressed. Fixes #3141

o0Ignition0o · 2023-07-05T15:03:58Z

As discussed this afternoon:

Subgraph errors (no data) when dealing with @requires:

if the subgraph returns an error, we should not add more messages
if the subgraph doesn't return any error, we should add a message.
Currently we return
"Subgraph response from '{}' was missing key _entities"
we should make it "subgraph did not return a valid graphql response. (and possibly mention the need for entities or either data or errors)

glasser · 2023-07-06T19:43:00Z

Would review from me still be helpful? Have been out since around when this was filed.

I'm not clear on what this means in the description:

If an HTTP status is not 2xx it will always be attached as a graphql error.

Is this only for JSON responses? What gets attached as the error?

BrynCooke · 2023-07-07T09:30:51Z

@glasser I'm going to work on this today and then I think it's worth reviewing from scratch once it's ready (I'm going to take this back to draft). I'll add some extra comments to clarify things.

BrynCooke · 2023-07-07T09:33:36Z

Would review from me still be helpful? Have been out since around when this was filed.

I'm not clear on what this means in the description:

If an HTTP status is not 2xx it will always be attached as a graphql error.

Is this only for JSON responses? What gets attached as the error?

It'll be for all responses. Basically there will never be a situation where http is not success and there are no errors in the grapghql response.

Spec says: If the data entry in the response is not present, the errors entry in the response must not be empty. It must contain at least one error. The errors it contains should indicate why no data was able to be returned.

apollo-router/src/response.rs

apollo-router/src/services/subgraph_service.rs

Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>

Geal · 2023-07-11T08:51:39Z

refactoring the subgraph service was a good idea, but here it makes the review really hard to do when we look for the actual error handling changes :/

apollo-router/src/services/subgraph_service.rs

Co-authored-by: Coenen Benjamin <benjamin.coenen@hotmail.com>

o0Ignition0o

looking great!

apollo-router/src/services/subgraph_service.rs

Geal · 2023-07-11T13:27:35Z

apollo-router/src/services/subgraph_service.rs

+        Some(Ok(Ok(content_type))) if (content_type.ty == APPLICATION && content_type.subty == JSON) => Ok(ContentType::ApplicationJson),
+        Some(Ok(Ok(content_type))) if (content_type.ty == APPLICATION && content_type.subty == GRAPHQL_RESPONSE && content_type.suffix == Some(JSON)) => Ok(ContentType::ApplicationGraphqlResponseJson),
+        Some(Ok(Ok(content_type))) => {


nit: it may be more readable to match once on Some(Ok(Ok(content_type))) then have a nested match or branches on content_type fields

This fixes cargo fmt so worth doing: 16de223

apollo-router/src/services/subgraph_service.rs

The graphql spec is lax about what strategy to use for processing responses: https://github.com/graphql/graphql-over-http/blob/main/spec/GraphQLOverHTTP.md#processing-the-response > If the response uses a non-200 status code and the media type of the response payload is application/json then the client MUST NOT rely on the body to be a well-formed GraphQL response since the source of the response may not be the server but instead some intermediary such as API gateways, proxies, firewalls, etc. The TLDR of this is that it's really asking us to do the best we can with whatever information we have with some modifications depending on content type. Our goal is to give the user the most relevant information possible in the response errors Rules: 1. If the content type of the response is not `application/json` or `application/graphql-response+json` then we won't try to parse. 2. If an HTTP status is not 2xx it will always be attached as a graphql error. 3. If the response type is `application/json` and status is not 2xx and the body is not valid grapqhql then the entire body of the response will be added as an error. Fixes #3141  **Checklist** Complete the checklist (and note appropriate exceptions) before a final PR is raised. - [x] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [x] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests: - please document the manual testing (extensively) in the Exceptions. - please raise a separate issue to automate the test and label it (or ask for it to be labeled) as `manual test` --------- Co-authored-by: bryn <bryn@apollographql.com> Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com> Co-authored-by: Coenen Benjamin <benjamin.coenen@hotmail.com>

@garypen

> **Note** > > When approved, this PR will merge into **the `1.24.0` branch** which will — upon being approved itself — merge into `main`. > > **Things to review in this PR**: > - Changelog correctness (There is a preview below, but it is not necessarily the most up to date. See the _Files Changed_ for the true reality.) > - Version bumps > - That it targets the right release branch (`1.24.0` in this case!). > --- ## 🚀 Features ### Add support for delta aggregation to otlp metrics ([PR #3412](#3412)) Add a new configuration option (Temporality) to the otlp metrics configuration. This may be useful to fix problems with metrics when being processed by datadog which tends to expect Delta, rather than Cumulative, aggregations. See: - open-telemetry/opentelemetry-collector-contrib#6129 - DataDog/documentation#15840 for more details. By [@garypen](https://github.com/garypen) in #3412 ## 🐛 Fixes ### Fix error handling for subgraphs ([Issue #3141](#3141)) The GraphQL spec is rather light on what should happen when we process responses from subgraphs. The current behaviour within the Router was inconsistently short circuiting response processing and this producing confusing errors. > #### Processing the response > > If the response uses a non-200 status code and the media type of the response payload is application/json then the client MUST NOT rely on the body to be a well-formed GraphQL response since the source of the response may not be the server but instead some intermediary such as API gateways, proxies, firewalls, etc. The logic has been simplified and made consistent using the following rules: 1. If the content type of the response is not `application/json` or `application/graphql-response+json` then we won't try to parse. 2. If an HTTP status is not 2xx it will always be attached as a graphql error. 3. If the response type is `application/json` and status is not 2xx and the body is not valid grapqhql the entire subgraph response will be attached as an error. By [@BrynCooke](https://github.com/BrynCooke) in #3328 ## 🛠 Maintenance ### chore: router-bridge 0.3.0+v2.4.8 -> =0.3.1+2.4.9 ([PR #3407](#3407)) Updates `router-bridge` from ` = "0.3.0+v2.4.8"` to ` = "0.3.1+v2.4.9"`, note that with this PR, this dependency is now pinned to an exact version. This version update started failing tests because of a minor ordering change and it was not immediately clear why the test was failing. Pinning this dependency (that we own) allows us to only bring in the update at the proper time and will make test failures caused by the update to be more easily identified. By [@EverlastingBugstopper](https://github.com/EverlastingBugstopper) in #3407 ### remove the compiler from Query ([Issue #3373](#3373)) The `Query` object caches information extracted from the query that is used to format responses. It was carrying an `ApolloCompiler` instance, but now we don't really need it anymore, since it is now cached at the query analysis layer. We also should not carry it in the supergraph request and execution request, because that makes the builders hard to manipulate for plugin authors. Since we are not exposing the compiler in the public API yet, we move it inside the context's private entries, where it will be easily accessible from internal code. By [@Geal](https://github.com/Geal) in #3367 ### move AllowOnlyHttpPostMutationsLayer at the supergraph service level ([PR #3374](#3374), [PR #3410](#3410)) Now that we have access to a compiler in supergraph requests, we don't need to look into the query plan to know if a request contains mutations By [@Geal](https://github.com/Geal) in #3374 & #3410 ### update opentelemetry to 0.19.0 ([Issue #2878](#2878)) We've updated the following opentelemetry related crates: ``` opentelemetry 0.18.0 -> 0.19.0 opentelemetry-datadog 0.6.0 -> 0.7.0 opentelemetry-http 0.7.0 -> 0.8.0 opentelemetry-jaeger 0.17.0 -> 0.18.0 opentelemetry-otlp 0.11.0 -> 0.12.0 opentelemetry-semantic-conventions 0.10.0 -> 0.11.0 opentelemetry-zipkin 0.16.0 -> 0.17.0 opentelemetry-prometheus 0.11.0 -> 0.12.0 tracing-opentelemetry 0.18.0 -> 0.19.0 ``` This allows us to close a number of opentelemetry related issues. Note: The prometheus specification mandates naming format and, unfortunately, the router had two metrics which weren't compliant. The otel upgrade enforces the specification, so the affected metrics are now renamed (see below). The two affected metrics in the router were: apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total If you are monitoring these metrics via prometheus, please update your dashboards with this name change. By [@garypen](https://github.com/garypen) in #3421 ### Synthesize defer labels without RNG or collisions ([PR #3381](#3381) and [PR #3423](#3423)) The `@defer` directive accepts a `label` argument, but it is optional. To more accurately handle deferred responses, the Router internally rewrites queries to add labels on the `@defer` directive where they are missing. Responses eventually receive the reverse treatment to look as expected by client. This was done be generating random strings, handling collision with existing labels, and maintaining a `HashSet` of which labels had been synthesized. Instead, we now add a prefix to pre-existing labels and generate new labels without it. When processing a response, the absence of that prefix indicates a synthetic label. By [@SimonSapin](https://github.com/SimonSapin) and [@o0Ignition0o](https://github.com/o0Ignition0o) in #3381 and #3423 ### Move subscription event execution at the execution service level ([PR #3395](#3395)) In order to prepare some future integration I moved the execution loop for subscription events at the execution_service level. By [@bnjjj](https://github.com/bnjjj) in #3395 ## 📚 Documentation ### Document claim augmentation via coprocessors ([Issue #3102](#3102)) Claims augmentation is a common use case where user information from the JWT claims is used to look up more context like roles from databases, before sending it to subgraphs. This can be done with subgraphs, but it was not documented yet, and there was confusion on the order in which the plugins were called. This clears the confusion and provides an example configuration. By [@Geal](https://github.com/Geal) in #3386

apollo-bot2 assigned BrynCooke Jun 27, 2023

This comment has been minimized.

Sign in to view

BrynCooke force-pushed the bryn/http-error-handling branch from eee38bf to 94978d8 Compare June 28, 2023 11:05

BrynCooke force-pushed the bryn/http-error-handling branch from 94978d8 to f7394fa Compare June 28, 2023 11:48

BrynCooke marked this pull request as ready for review June 28, 2023 12:11

BrynCooke requested review from Geal, o0Ignition0o and glasser June 28, 2023 12:12

lennyburdette reviewed Jun 29, 2023

View reviewed changes

.changesets/fix_taffy_crank_flier_seaweed.md Outdated Show resolved Hide resolved

BrynCooke mentioned this pull request Jun 29, 2023

Router treats subgraph 4xx responses as errors #2687

Closed

bnjjj reviewed Jun 30, 2023

View reviewed changes

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

Geal requested changes Jun 30, 2023

View reviewed changes

bryn added 3 commits July 4, 2023 15:04

Take in feedback, use map_err instead of match.

7534cc6

Take in feedback, Make sure leave active request is called.

397a5b4

BrynCooke force-pushed the bryn/http-error-handling branch from e22cdda to 397a5b4 Compare July 4, 2023 14:20

Temp

dd1cb39

Merge and take in feedback

a3b177a

bryn added 5 commits July 7, 2023 10:56

Clippy and test fixes

c51ea4b

Update comment

d5c4581

Add test for no data in response.

7376024

Spec says: If the data entry in the response is not present, the errors entry in the response must not be empty. It must contain at least one error. The errors it contains should indicate why no data was able to be returned.

Merge branch 'dev' into bryn/http-error-handling

6c5b1f5

Fix tests

994e647

BrynCooke requested a review from Geal July 7, 2023 16:12

bryn and others added 2 commits July 10, 2023 13:36

Fix changelog

b257f83

Merge branch 'dev' into bryn/http-error-handling

0c35207

o0Ignition0o suggested changes Jul 11, 2023

View reviewed changes

apollo-router/src/response.rs Outdated Show resolved Hide resolved

apollo-router/src/response.rs Outdated Show resolved Hide resolved

o0Ignition0o reviewed Jul 11, 2023

View reviewed changes

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

BrynCooke and others added 2 commits July 11, 2023 09:42

Update apollo-router/src/response.rs

552c72c

Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>

Use return rather than ?

709bb2e

bnjjj reviewed Jul 11, 2023

View reviewed changes

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

bryn and others added 3 commits July 11, 2023 10:07

Revert unintended changes due to merge error.

ab60f8c

Update apollo-router/src/services/subgraph_service.rs

d3a0948

Co-authored-by: Coenen Benjamin <benjamin.coenen@hotmail.com>

Update apollo-router/src/services/subgraph_service.rs

bfc17f2

Co-authored-by: Coenen Benjamin <benjamin.coenen@hotmail.com>

BrynCooke requested a review from o0Ignition0o July 11, 2023 09:19

bryn added 3 commits July 11, 2023 10:30

Use null instead of empty json

d35b119

Formatting

8ed9b1c

Formatting

45a8bde

o0Ignition0o approved these changes Jul 11, 2023

View reviewed changes

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

bryn added 2 commits July 11, 2023 13:46

Fix comment

77ccb05

Rename content_type -> get_graphql_content_type

61a9a96

Geal approved these changes Jul 11, 2023

View reviewed changes

bryn added 4 commits July 11, 2023 14:33

Add back instrumentation

3e5ce2e

Move to random listen address for all tests.

6aec2e0

Tweak code in get_graphql_content_type for readability

16de223

Clippy

d9e0573

BrynCooke enabled auto-merge (squash) July 11, 2023 14:00

BrynCooke merged commit 33f35e9 into dev Jul 11, 2023

BrynCooke deleted the bryn/http-error-handling branch July 11, 2023 14:17

bnjjj mentioned this pull request Jul 13, 2023

prep release: v1.24.0 #3431

Merged

bnjjj mentioned this pull request Jul 18, 2023

Implement callback subscription protocol in new plugin ApolloServerPluginSubscriptionCallback apollographql/apollo-server#7617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error handling for subgraphs #3328

Fix error handling for subgraphs #3328

BrynCooke commented Jun 27, 2023 •

edited

Loading

This comment has been minimized.

router-perf bot commented Jun 27, 2023

BrynCooke commented Jun 28, 2023

o0Ignition0o commented Jul 5, 2023 •

edited

Loading

glasser commented Jul 6, 2023

BrynCooke commented Jul 7, 2023

BrynCooke commented Jul 7, 2023

Geal commented Jul 11, 2023

o0Ignition0o left a comment

Geal Jul 11, 2023

BrynCooke Jul 11, 2023

Fix error handling for subgraphs #3328

Fix error handling for subgraphs #3328

Conversation

BrynCooke commented Jun 27, 2023 • edited Loading

This comment has been minimized.

router-perf bot commented Jun 27, 2023

BrynCooke commented Jun 28, 2023

o0Ignition0o commented Jul 5, 2023 • edited Loading

glasser commented Jul 6, 2023

BrynCooke commented Jul 7, 2023

BrynCooke commented Jul 7, 2023

Geal commented Jul 11, 2023

o0Ignition0o left a comment

Choose a reason for hiding this comment

Geal Jul 11, 2023

Choose a reason for hiding this comment

BrynCooke Jul 11, 2023

Choose a reason for hiding this comment

BrynCooke commented Jun 27, 2023 •

edited

Loading

o0Ignition0o commented Jul 5, 2023 •

edited

Loading