Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use Envoy's default for validate_clusters to fix breaking routes when some backend clusters don't exist #21587

Merged
merged 3 commits into from
Aug 20, 2024

Conversation

ndhanushkodi
Copy link
Contributor

@ndhanushkodi ndhanushkodi commented Aug 5, 2024

Description

The validate_clusters option in Envoy's route configuration says:

"An optional boolean that specifies whether the clusters that the route table refers to will be validated by the cluster manager. If set to true and a route refers to a non-existent cluster, the route table will not load. If set to false and a route refers to a non-existent cluster, the route table will load and the router filter will return a 404 if the route is selected at runtime. This setting defaults to true if the route table is statically defined via the route_config option. This setting default to false if the route table is loaded dynamically via the rds option. Users may wish to override the default behavior in certain cases (for example when using CDS with a static route table)."

We are setting it dynamically via RDS, but overriding the default value to set it explicitly to true. This means when a cluster that the route is supposed to point to doesn't exist, the route can fail to route to any of its backends. This case can be triggered if you have a router -> resolver where the resolver has backends on different peers/wan federated backends, and you add a route to a backend that doesn't exist. The non-existent backend causes the existing backends to fail. I was not able to trigger this case in a single cluster setup, but with a peered backend it can be triggered.

Because, the traffic doesn't just blackhole, but rather returns a 503, this actually seems to be the desired behavior, rather than making all other routing paths within that route fail due to a missing cluster. This is similar to the conclusion that was reached within the Jira ticket.

This PR removes the code that overrides the default value of this validate_clusters option.

Testing & Reproduction steps

Links

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

@github-actions github-actions bot added the theme/envoy/xds Related to Envoy support label Aug 5, 2024
@ndhanushkodi ndhanushkodi added the backport/all Apply backports for all active releases per .release/versions.hcl label Aug 14, 2024
@ndhanushkodi ndhanushkodi force-pushed the nd/net-10435-cluster-validation branch 2 times, most recently from 66b128d to a761f2a Compare August 14, 2024 16:43
@ndhanushkodi ndhanushkodi added backport/1.19 Changes are backported to 1.19 and removed backport/all Apply backports for all active releases per .release/versions.hcl labels Aug 14, 2024
@ndhanushkodi ndhanushkodi changed the title don't validate clusters fix: use Envoy's default for validate_clusters to fix breaking routes when some backend clusters don't exist Aug 14, 2024
@ndhanushkodi ndhanushkodi force-pushed the nd/net-10435-cluster-validation branch from a761f2a to 631d61f Compare August 15, 2024 16:52
@ndhanushkodi ndhanushkodi marked this pull request as ready for review August 15, 2024 16:54
@ndhanushkodi ndhanushkodi force-pushed the nd/net-10435-cluster-validation branch 2 times, most recently from 1e0bec9 to 12f4daf Compare August 15, 2024 19:43
Copy link
Member

@zalimeni zalimeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review LGTM! Once we introduce a flag I'm guessing we'll want to have a golden smoke test to prove it out but the general shape of these changes makes sense to me 👍🏻

zalimeni

This comment was marked as resolved.

@ndhanushkodi ndhanushkodi force-pushed the nd/net-10435-cluster-validation branch from 3fd3227 to ab15bc9 Compare August 18, 2024 05:54
@ndhanushkodi ndhanushkodi requested a review from a team as a code owner August 18, 2024 05:54
Copy link
Member

@zalimeni zalimeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just a few small comments! Approved to unblock

agent/structs/config_entry_mesh.go Outdated Show resolved Hide resolved
agent/xds/routes.go Outdated Show resolved Hide resolved
api/config_entry_mesh.go Outdated Show resolved Hide resolved
Copy link
Member

@zalimeni zalimeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one more question @ndhanushkodi , though happy to do this in a follow-up PR: I'm assuming we need some public docs updates as well? Just realized these changes are exlusively go/proto docs.

@ndhanushkodi ndhanushkodi requested a review from a team as a code owner August 19, 2024 22:11
@ndhanushkodi
Copy link
Contributor Author

@zalimeni I originally thought to add docs as a followup but it was easy enough so I just went ahead and added here.

Copy link
Contributor

@boruszak boruszak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to match PR 9699.

Approving on behalf of consul-docs.

Comment on lines 349 to 354
`ValidateClusters is false by default and configures whether Envoy proxies will validate clusters in a route. If
set to true and any clusters in the route do not exist, the route table will not load. If set to false, the
route table will load and routing to a non-existent cluster will result in a 404. See
[Envoy docs](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route.proto#envoy-v3-api-field-config-route-v3-routeconfiguration-validate-clusters)
for more details. `,
Copy link
Contributor

@boruszak boruszak Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`ValidateClusters is false by default and configures whether Envoy proxies will validate clusters in a route. If
set to true and any clusters in the route do not exist, the route table will not load. If set to false, the
route table will load and routing to a non-existent cluster will result in a 404. See
[Envoy docs](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route.proto#envoy-v3-api-field-config-route-v3-routeconfiguration-validate-clusters)
for more details. `,
`Controls whether the clusters the route table refers to are validated. The default value is false. When set to false and a route refers to a cluster that does not exist, the route table loads and routing to a non-existent cluster results in a 404. When set to true and the route is set to a cluster that do not exist, the route table will not load. For more information, refer to
[HTTP route configuration in the Envoy docs](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route.proto#envoy-v3-api-field-config-route-v3-routeconfiguration-validate-clusters). `,

Copying my suggestion from the other PR

@ndhanushkodi ndhanushkodi force-pushed the nd/net-10435-cluster-validation branch from 0a51677 to 16d59ad Compare August 19, 2024 23:31
@ndhanushkodi ndhanushkodi merged commit ed738a6 into main Aug 20, 2024
92 checks passed
@ndhanushkodi ndhanushkodi deleted the nd/net-10435-cluster-validation branch August 20, 2024 05:39
philrenaud pushed a commit that referenced this pull request Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.19 Changes are backported to 1.19 theme/envoy/xds Related to Envoy support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants