Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constrain APOLLO_ROUTER_LOG and --log global levels to the router #3477

Merged
merged 7 commits into from
Jul 21, 2023

Conversation

garypen
Copy link
Contributor

@garypen garypen commented Jul 20, 2023

It would be nice if users could specify just a log level and the router applied the required filtering to constrain the logging to the apollo_router module.

It would also be nice if, for advanced users, you could exercise the full power of a logging filter.

This PR enables both these use cases.

If you set a filter using RUST_LOG, it is used as is.

If you set it using APOLLO_ROUTER_LOG or --log, then any "global" scope levels are constrained to apollo_router.

Thus:

RUST_LOG=apollo_router=warn
--log warn
APOLLO_ROUTER_LOG=warn

are equivalent with all three statements resulting in warn level logging for the router.

For more details, read the logging configuration documentation.

fixes: #3474

Checklist

Complete the checklist (and note appropriate exceptions) before a final PR is raised.

  • Changes are compatible[^1]
  • Documentation[^2] completed
  • Performance impact assessed and acceptable
  • Tests added and passing[^3]
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as manual test

It would be nice if users could specify just a log level and the router
applied the required filtering to constrain the logging to the
`apollo_router` module.

It would also be nice if, for advanced users, you could exercise the
full power of a logging filter.

This PR enables both these use cases.

If you set a filter using `RUST_LOG`, it is used as is.

If you set it using `APOLLO_ROUTER_LOG` or `--log`, then any "global"
scope levels are constrained to `apollo_router`.

Thus:

```
RUST_LOG=apollo_router=warn
--log warn
APOLLO_ROUTER_LOG=warn
```

are equivalent with all three statements resulting in `warn` level
logging for the router.

For more details, read the logging configuration documentation.

fixes: #3474
@garypen garypen self-assigned this Jul 20, 2023
@github-actions

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Jul 20, 2023

CI performance tests

  • xxlarge-request - Stress test with 100 MB request payload
  • step - Basic stress test that steps up the number of users over time
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • reload - Reload test over a long period of time at a constant rate of users
  • xlarge-request - Stress test with 10 MB request payload
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity

@garypen garypen marked this pull request as ready for review July 20, 2023 11:21
@garypen garypen requested a review from a team as a code owner July 20, 2023 11:21
@garypen garypen requested review from bnjjj and BrynCooke July 20, 2023 11:21
Copy link
Contributor

@BrynCooke BrynCooke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with suggestion

Co-authored-by: Bryn Cooke <BrynCooke@gmail.com>
@garypen
Copy link
Contributor Author

garypen commented Jul 21, 2023

Approved with suggestion

I applied your suggestion and it lost some text. I'll fix it up.

@garypen garypen enabled auto-merge (squash) July 21, 2023 10:06
@garypen garypen disabled auto-merge July 21, 2023 11:02
@garypen garypen merged commit 0b49811 into dev Jul 21, 2023
2 checks passed
@garypen garypen deleted the garypen/3474-log-filter branch July 21, 2023 11:03
This was referenced Aug 11, 2023
o0Ignition0o added a commit that referenced this pull request Aug 18, 2023
> **Note**
>
> When approved, this PR will merge into **the `1.27.0` branch** which
will — upon being approved itself — merge into `main`.
>
> **Things to review in this PR**:
> - Changelog correctness (There is a preview below, but it is not
necessarily the most up to date. See the _Files Changed_ for the true
reality.)
>  - Version bumps
>  - That it targets the right release branch (`1.27.0` in this case!).
>
---
## 🚀 Features

### Add a metric tracking coprocessor latency ([Issue
#2924](#2924))

Introduces a new metric for the router:

```
apollo.router.operations.coprocessor.duration
```

It has one attribute:

```
coprocessor.stage: string (RouterRequest, RouterResponse, SubgraphRequest, SubgraphResponse)
```

It is a histogram metric tracking the time spent calling into the
coprocessor.

Note that the name of this metric may change in the future.

By [@Geal](https://github.com/Geal) in
#3513

### Configure AWS SigV4 authentication for subgraph requests ([PR
#3365](#3365))

Secure your router to subgraph communication on AWS using [Signature
Version
4](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html)
(Sigv4)!
This changeset provides you with a way to set up hard-coded credentials,
as well as a default provider chain.
We recommend using the default provider chain configuration.

Full use example:

```yaml
    authentication:
      subgraph:
        all: # configuration that will apply to all subgraphs
          aws_sig_v4:
            default_chain:
              profile_name: "my-test-profile" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#ec2-instance-profile
              region: "us-east-1" # https://docs.aws.amazon.com/general/latest/gr/rande.html
              service_name: "lambda" # https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_aws-services-that-work-with-iam.html
              assume_role: # https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html
                role_arn: "test-arn"
                session_name: "test-session"
                external_id: "test-id"
        subgraphs:
          products:
            aws_sig_v4:
              hardcoded: # Not recommended, prefer using default_chain as shown above
                access_key_id: "my-access-key"
                secret_access_key: "my-secret-access-key"
                region: "us-east-1"
                service_name: "vpc-lattice-svcs" # "s3", "lambda" etc.
```

The full documentation can be found in the [router
documentation](https://www.apollographql.com/docs/router/configuration/authn-subgraph).

By [@o0Ignition0o](https://github.com/o0Ignition0o) and
[@BlenderDude](https://github.com/BlenderDude) in
#3365

### Helm: add init containers to deployment ([Issue
#3248](#3248))

This is a new option when deploying the router, so that containers may
be specified which execute before the Router container (or any extra
Containers) begin executing. You can read more about [init
containers](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
in the Kubernetes documentation.

By [@laszlorostas](https://github.com/laszlorostas) in
#3444

### Helm: expose the `lifecycle` object on the router container ([Issue
#3563](#3563))

You can now set the [Kubernetes `lifecycle`
object](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/)
on the router container in the helm chart.

By [@bjoernw](https://github.com/bjoernw) in
#3570

## 🐛 Fixes

### Require the main (GraphQL) route to shutdown before other routes
([Issue #3521](#3521))

Router shutdown sequence has been improved to ensure that the main
(GraphQL) route is shutdown before other routes are shutdown. Prior to
this change all routes shut down in parallel and this would mean that,
for example, health checks stopped responding prematurely.

This was particularly undesirable when the router is executing in
Kubernetes, since continuing to report live/ready checks during shutdown
is a requirement.

By [@garypen](https://github.com/garypen) in
#3557

### Spelling of `content_negociation` corrected to `content_negotiation`
([Issue #3204](#3204))

We had a bit of a French twist on one of our internal module names. We
won't promise it won't happen again, but `content_negociation` is
spelled as `content_negotiation` now. 😄

Thank you for this contribution!

By [@krishna15898](https://github.com/krishna15898) in
#3162

### Fix Redis reconnections ([Issue
#3045](#3045))

Redis reconnection policy was using an exponential backoff delay with a
maximum number of attempts. Once that maximum is reached, reconnection
was never tried again (there's no baseline retry). The router will now
always retry with a maximum delay of 2 seconds, and a timeout of 1
millisecond, so that the router can continue serving requests in the
meantime.

This commit contains additional fixes:
- Release the lock on the in-memory cache while waiting for Redis, to
let the in memory cache serve other requests.
- Add a custom serializer for the `SubSelectionKey` type. This type is
used as key in a `HashMap` which is converted to a JSON object. Since
object keys in JSON must be strings, a specific serializer was needed
instead of the derived one.

By [@Geal](https://github.com/Geal) in
#3509

### Close the subscription when a new supergraph becomes active ([Issue
#3320](#3320))

Router schema/supergraph updates weren't resetting existing
subscriptions which meant they could run with an out of date query plan.

With this change, the router will signal clients that a
`SUBSCRIPTION_SCHEMA_RELOAD` has occurred and close the running
subscription. Clients will then subscribe again:


```json
{
  "errors": [
    {
      "message": "subscription has been closed due to a schema reload",
      "extensions": {
        "code": "SUBSCRIPTION_SCHEMA_RELOAD"
      }
    }
  ]
}
```


By [@bnjjj](https://github.com/bnjjj) in
#3341

### Redis storage: return an error if a non serializable value is sent.
([#3594](#3594))

An error will now be logged when a value is unable to be serialized
before being sent to the Redis storage backend. The message suggests
opening an issue since this would be a router bug that we'd need to fix!

By [@o0Ignition0o](https://github.com/o0Ignition0o) in
#3597

### Handle ping/pong websocket messages before the Ack message is
received ([PR #3562](#3562))

Websocket servers will sometimes send Ping() messages before they Ack
the connection initialization. This changeset allows the router to send
Pong() messages, while still waiting until either
`CONNECTION_ACK_TIMEOUT` elapsed, or the server successfully Acked the
websocket connection start.

By [@o0Ignition0o](https://github.com/o0Ignition0o) in
#3562

### Subscription requests only count in telemetry if the feature is
actually enabled ([PR
#3500](#3500))

Count subscription requests only if the feature is enabled.

The router would previously count subscription requests regardless of
whether the feature was enabled or not. This fix changes the behavior to
only count subscription requests when the feature is enabled.

By [@bnjjj](https://github.com/bnjjj) in
#3500

## 🛠 Maintenance

### Update `datadog-subgraph/`'s npm dependencies ([PR
#3560](#3560))

This changeset updates the `dd-trace` dependency and the Node.js version
of the example Dockerfile.

By [@o0Ignition0o](https://github.com/o0Ignition0o) in
#3560

### Remove some `panic!` calls in persisted query logic ([PR
#3527](#3527))

Replace a few `panic!` calls with `expect()`s in the persisted query
code to improve clarity.

By [@BrynCooke](https://github.com/BrynCooke) in
#3527

### Add a warning if we think `istio-proxy` injection is causing
problems ([Issue
#3533](#3533))

We have encountered situations where the injection of `istio-proxy` in a
router pod (executing in Kubernetes) causes networking errors during
[Apollo
Uplink](https://www.apollographql.com/docs/federation/managed-federation/uplink/)
communication.

The situation isn't due to Apollo Uplink, but rather the router is
executing and attempting to retrieve Apollo Uplink data while the
`istio-proxy` is simultaneously modifying its network configuration.

This new warning message directs users to information which should help
them to configure their Kubernetes cluster or pod to avoid this problem.

By [@garypen](https://github.com/garypen) in
#3545

### Log when custom plugins are detected with potentially-silenced log
entries ([Issue
#3526](#3526))

Since [PR #3477](#3477),
users with custom plugins lost some log entries. This is because the
default logging filter now restricts log entries to those that are in
the `apollo` module, as originally intended.

Users that have custom plugins need to configure the logging filter to
include their modules, but may not have realised this.

Now, if a custom plugin is detected, a message will be logged to the
console indicating that the logging filter may need to be configured.

By [@BrynCooke](https://github.com/BrynCooke) in
#3540

### Parent based sampling tests ([PR
#3136](#3136))

This adds test for OpenTelemetry sampling defined either in the
configuration or in headers carried by the request

By [@Geal](https://github.com/Geal) in
#3136

## 📚 Documentation

### Redis URL format ([Issue
#3534](#3534))

The Redis client used in the Router follows a convention on Redis server
URLs to indicate TLS, cluster or sentinel usage


By [@Geal](https://github.com/Geal) in
#3556

### Request lifecycle ([PR
#3391](#3391))

This adds in-depth documentation of:
- the entire request lifecycle
- which services exist in the router
- the request and response types they use
- where plugins can attach themselves

By [@Geal](https://github.com/Geal)
[@Meschreiber](https://github.com/Meschreiber) in
#3391

### TLS termination and subgraph overrides ([Issue
#3100](#3100))

TLS termination was added in [PR
#2614](#2614) but never
documented. Subgraph certificate override was added in [PR
#2008](#2008) but the
documentation missed some details on self-signed certificates. These
have both been corrected!

By [@Geal](https://github.com/Geal) in
#3436

### `self` is immutable in the `Plugin` trait's methods ([Issue
#3539](#3539))

The documentation previously displayed `Plugin`'s methods as taking a
mutable reference to `self`, while they actually take an _immutable_
reference to it.

We've fixed the documentation.

By [@Geal](https://github.com/Geal) in
#3555

---------

Co-authored-by: Gary Pennington <gary@apollographql.com>
Co-authored-by: Bryn Cooke <BrynCooke@gmail.com>
Co-authored-by: Jesse Rosenberger <git@jro.cc>
Co-authored-by: Chandrika Srinivasan <chandrikas@users.noreply.github.com>
@Meemaw
Copy link
Contributor

Meemaw commented Aug 22, 2023

@garypen would this also affect traces/spans? For example, could this change lead to spans instrumented in custom plugin to not show up?

@garypen
Copy link
Contributor Author

garypen commented Aug 22, 2023

@Meemaw Yes. If you have a custom plugin, it's better to use RUST_LOG to set trace/logging levels. APOLLO_ROUTER_LOG is a deliberately simplified environment variable designed to help with most common use cases for logging/tracing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loglevel Trace causes Router to crash
4 participants