add a warning if we think istio-proxy injection is causing problems #3545

garypen · 2023-08-07T11:01:25Z

We have encountered situations where the injection of istio-proxy in a router pod (executing in kubernetes) causes strange networking errors during uplink retrieval.

The root cause of the issue is that the router is executing and attempting retrieve uplink schemas whilst the istio-proxy is modifying network configuration at the same time.

This warning message will direct users to information which should help them to configure their cluster or pod to avoid this problem.

fixes: #3533

Checklist

Complete the checklist (and note appropriate exceptions) before a final PR is raised.

Changes are compatible[^1]
Documentation[^2] completed
Performance impact assessed and acceptable
Tests added and passing[^3]
~~- [ ] Unit Tests~~
~~- [ ] Integration Tests~~
- Manual Tests

Exceptions

This is difficult to test, since the root cause (istio networking re-configuration as the pod executes) is very difficult to reproduce in a test environment.

Manual testing was performed by:

triggering a nightly build
deploying the resulting image to a test cluster
using the istio pod annotation: proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": false }' to ensure the error occurred
observing the desired warning message is produced

Here's some sample output from the test:

{"timestamp":"2023-08-07T13:04:15.762898Z","level":"WARN","message":"If your router is executing within a kubernetes pod, this failure may be caused by istio-proxy injection. See https://github.com/apollographql/router/issues/3533 for more details about how to solve this","target":"apollo_router::uplink"}
{"timestamp":"2023-08-07T13:04:15.782715Z","level":"ERROR","message":"fetch failed from all endpoints","target":"apollo_router::router::event::schema"}

Notes

[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as manual test

We have encountered situations where the injection of istio-proxy in a router pod (executing in kubernetes) causes strange networking errors during uplink retrieval. The root cause of the issue is that the router is executing and attempting retrieve uplink schemas whilst the istio-proxy is modifying network configuration at the same time. This warning message will direct users to information which should help them to configure their cluster or pod to avoid this problem. fixes: #3533

router-perf · 2023-08-07T11:01:56Z

and a changeset

Meschreiber

Teeny copy edit suggestions. Thanks for documenting this!

.changesets/maint_garypen_3533_istio_warn.md

docs/source/containerization/kubernetes.mdx

Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

@Geal

> **Note** > > When approved, this PR will merge into **the `1.27.0` branch** which will — upon being approved itself — merge into `main`. > > **Things to review in this PR**: > - Changelog correctness (There is a preview below, but it is not necessarily the most up to date. See the _Files Changed_ for the true reality.) > - Version bumps > - That it targets the right release branch (`1.27.0` in this case!). > --- ## 🚀 Features ### Add a metric tracking coprocessor latency ([Issue #2924](#2924)) Introduces a new metric for the router: ``` apollo.router.operations.coprocessor.duration ``` It has one attribute: ``` coprocessor.stage: string (RouterRequest, RouterResponse, SubgraphRequest, SubgraphResponse) ``` It is a histogram metric tracking the time spent calling into the coprocessor. Note that the name of this metric may change in the future. By [@Geal](https://github.com/Geal) in #3513 ### Configure AWS SigV4 authentication for subgraph requests ([PR #3365](#3365)) Secure your router to subgraph communication on AWS using [Signature Version 4](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html) (Sigv4)! This changeset provides you with a way to set up hard-coded credentials, as well as a default provider chain. We recommend using the default provider chain configuration. Full use example: ```yaml authentication: subgraph: all: # configuration that will apply to all subgraphs aws_sig_v4: default_chain: profile_name: "my-test-profile" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#ec2-instance-profile region: "us-east-1" # https://docs.aws.amazon.com/general/latest/gr/rande.html service_name: "lambda" # https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_aws-services-that-work-with-iam.html assume_role: # https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html role_arn: "test-arn" session_name: "test-session" external_id: "test-id" subgraphs: products: aws_sig_v4: hardcoded: # Not recommended, prefer using default_chain as shown above access_key_id: "my-access-key" secret_access_key: "my-secret-access-key" region: "us-east-1" service_name: "vpc-lattice-svcs" # "s3", "lambda" etc. ``` The full documentation can be found in the [router documentation](https://www.apollographql.com/docs/router/configuration/authn-subgraph). By [@o0Ignition0o](https://github.com/o0Ignition0o) and [@BlenderDude](https://github.com/BlenderDude) in #3365 ### Helm: add init containers to deployment ([Issue #3248](#3248)) This is a new option when deploying the router, so that containers may be specified which execute before the Router container (or any extra Containers) begin executing. You can read more about [init containers](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) in the Kubernetes documentation. By [@laszlorostas](https://github.com/laszlorostas) in #3444 ### Helm: expose the `lifecycle` object on the router container ([Issue #3563](#3563)) You can now set the [Kubernetes `lifecycle` object](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/) on the router container in the helm chart. By [@bjoernw](https://github.com/bjoernw) in #3570 ## 🐛 Fixes ### Require the main (GraphQL) route to shutdown before other routes ([Issue #3521](#3521)) Router shutdown sequence has been improved to ensure that the main (GraphQL) route is shutdown before other routes are shutdown. Prior to this change all routes shut down in parallel and this would mean that, for example, health checks stopped responding prematurely. This was particularly undesirable when the router is executing in Kubernetes, since continuing to report live/ready checks during shutdown is a requirement. By [@garypen](https://github.com/garypen) in #3557 ### Spelling of `content_negociation` corrected to `content_negotiation` ([Issue #3204](#3204)) We had a bit of a French twist on one of our internal module names. We won't promise it won't happen again, but `content_negociation` is spelled as `content_negotiation` now. 😄 Thank you for this contribution! By [@krishna15898](https://github.com/krishna15898) in #3162 ### Fix Redis reconnections ([Issue #3045](#3045)) Redis reconnection policy was using an exponential backoff delay with a maximum number of attempts. Once that maximum is reached, reconnection was never tried again (there's no baseline retry). The router will now always retry with a maximum delay of 2 seconds, and a timeout of 1 millisecond, so that the router can continue serving requests in the meantime. This commit contains additional fixes: - Release the lock on the in-memory cache while waiting for Redis, to let the in memory cache serve other requests. - Add a custom serializer for the `SubSelectionKey` type. This type is used as key in a `HashMap` which is converted to a JSON object. Since object keys in JSON must be strings, a specific serializer was needed instead of the derived one. By [@Geal](https://github.com/Geal) in #3509 ### Close the subscription when a new supergraph becomes active ([Issue #3320](#3320)) Router schema/supergraph updates weren't resetting existing subscriptions which meant they could run with an out of date query plan. With this change, the router will signal clients that a `SUBSCRIPTION_SCHEMA_RELOAD` has occurred and close the running subscription. Clients will then subscribe again: ```json { "errors": [ { "message": "subscription has been closed due to a schema reload", "extensions": { "code": "SUBSCRIPTION_SCHEMA_RELOAD" } } ] } ``` By [@bnjjj](https://github.com/bnjjj) in #3341 ### Redis storage: return an error if a non serializable value is sent. ([#3594](#3594)) An error will now be logged when a value is unable to be serialized before being sent to the Redis storage backend. The message suggests opening an issue since this would be a router bug that we'd need to fix! By [@o0Ignition0o](https://github.com/o0Ignition0o) in #3597 ### Handle ping/pong websocket messages before the Ack message is received ([PR #3562](#3562)) Websocket servers will sometimes send Ping() messages before they Ack the connection initialization. This changeset allows the router to send Pong() messages, while still waiting until either `CONNECTION_ACK_TIMEOUT` elapsed, or the server successfully Acked the websocket connection start. By [@o0Ignition0o](https://github.com/o0Ignition0o) in #3562 ### Subscription requests only count in telemetry if the feature is actually enabled ([PR #3500](#3500)) Count subscription requests only if the feature is enabled. The router would previously count subscription requests regardless of whether the feature was enabled or not. This fix changes the behavior to only count subscription requests when the feature is enabled. By [@bnjjj](https://github.com/bnjjj) in #3500 ## 🛠 Maintenance ### Update `datadog-subgraph/`'s npm dependencies ([PR #3560](#3560)) This changeset updates the `dd-trace` dependency and the Node.js version of the example Dockerfile. By [@o0Ignition0o](https://github.com/o0Ignition0o) in #3560 ### Remove some `panic!` calls in persisted query logic ([PR #3527](#3527)) Replace a few `panic!` calls with `expect()`s in the persisted query code to improve clarity. By [@BrynCooke](https://github.com/BrynCooke) in #3527 ### Add a warning if we think `istio-proxy` injection is causing problems ([Issue #3533](#3533)) We have encountered situations where the injection of `istio-proxy` in a router pod (executing in Kubernetes) causes networking errors during [Apollo Uplink](https://www.apollographql.com/docs/federation/managed-federation/uplink/) communication. The situation isn't due to Apollo Uplink, but rather the router is executing and attempting to retrieve Apollo Uplink data while the `istio-proxy` is simultaneously modifying its network configuration. This new warning message directs users to information which should help them to configure their Kubernetes cluster or pod to avoid this problem. By [@garypen](https://github.com/garypen) in #3545 ### Log when custom plugins are detected with potentially-silenced log entries ([Issue #3526](#3526)) Since [PR #3477](#3477), users with custom plugins lost some log entries. This is because the default logging filter now restricts log entries to those that are in the `apollo` module, as originally intended. Users that have custom plugins need to configure the logging filter to include their modules, but may not have realised this. Now, if a custom plugin is detected, a message will be logged to the console indicating that the logging filter may need to be configured. By [@BrynCooke](https://github.com/BrynCooke) in #3540 ### Parent based sampling tests ([PR #3136](#3136)) This adds test for OpenTelemetry sampling defined either in the configuration or in headers carried by the request By [@Geal](https://github.com/Geal) in #3136 ## 📚 Documentation ### Redis URL format ([Issue #3534](#3534)) The Redis client used in the Router follows a convention on Redis server URLs to indicate TLS, cluster or sentinel usage By [@Geal](https://github.com/Geal) in #3556 ### Request lifecycle ([PR #3391](#3391)) This adds in-depth documentation of: - the entire request lifecycle - which services exist in the router - the request and response types they use - where plugins can attach themselves By [@Geal](https://github.com/Geal) [@Meschreiber](https://github.com/Meschreiber) in #3391 ### TLS termination and subgraph overrides ([Issue #3100](#3100)) TLS termination was added in [PR #2614](#2614) but never documented. Subgraph certificate override was added in [PR #2008](#2008) but the documentation missed some details on self-signed certificates. These have both been corrected! By [@Geal](https://github.com/Geal) in #3436 ### `self` is immutable in the `Plugin` trait's methods ([Issue #3539](#3539)) The documentation previously displayed `Plugin`'s methods as taking a mutable reference to `self`, while they actually take an _immutable_ reference to it. We've fixed the documentation. By [@Geal](https://github.com/Geal) in #3555 --------- Co-authored-by: Gary Pennington <gary@apollographql.com> Co-authored-by: Bryn Cooke <BrynCooke@gmail.com> Co-authored-by: Jesse Rosenberger <git@jro.cc> Co-authored-by: Chandrika Srinivasan <chandrikas@users.noreply.github.com>

garypen self-assigned this Aug 7, 2023

This comment has been minimized.

Sign in to view

garypen added 3 commits August 7, 2023 12:43

add some documentation for the problem

3c418ca

and a changeset

remember to include the changeset in the commit...

1db4a09

tidy up the doc entry a little

c18c6c7

garypen requested review from a team, Geal, SimonSapin, bnjjj and BrynCooke and removed request for bnjjj August 7, 2023 11:46

Merge branch 'dev' into garypen/3533-istio-warn

4f9e97b

garypen marked this pull request as ready for review August 7, 2023 11:50

garypen requested a review from a team as a code owner August 7, 2023 11:50

garypen added 2 commits August 7, 2023 14:10

Merge branch 'dev' into garypen/3533-istio-warn

fe499e2

Merge branch 'dev' into garypen/3533-istio-warn

feb0546

BrynCooke approved these changes Aug 8, 2023

View reviewed changes

Geal approved these changes Aug 8, 2023

View reviewed changes

Meschreiber approved these changes Aug 8, 2023

View reviewed changes

garypen and others added 5 commits August 9, 2023 08:35

Update .changesets/maint_garypen_3533_istio_warn.md

a84c91a

Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

Update .changesets/maint_garypen_3533_istio_warn.md

58d47ba

Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

Update .changesets/maint_garypen_3533_istio_warn.md

5011619

Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

Update docs/source/containerization/kubernetes.mdx

ec72e05

Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

Update docs/source/containerization/kubernetes.mdx

b075be3

Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

garypen merged commit 5852267 into dev Aug 9, 2023
2 checks passed

garypen deleted the garypen/3533-istio-warn branch August 9, 2023 09:21

This was referenced Aug 11, 2023

prep release: v1.27.0-alpha.0 #3564

Merged

prep release: v1.27.0 #3602

Merged

abernix mentioned this pull request Sep 4, 2023

Consider increasing default timeout for uplink #3241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a warning if we think istio-proxy injection is causing problems #3545

add a warning if we think istio-proxy injection is causing problems #3545

garypen commented Aug 7, 2023 •

edited

Loading

This comment has been minimized.

router-perf bot commented Aug 7, 2023

Meschreiber left a comment

add a warning if we think istio-proxy injection is causing problems #3545

add a warning if we think istio-proxy injection is causing problems #3545

Conversation

garypen commented Aug 7, 2023 • edited Loading

This comment has been minimized.

router-perf bot commented Aug 7, 2023

Meschreiber left a comment

Choose a reason for hiding this comment

garypen commented Aug 7, 2023 •

edited

Loading