Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A33: Client-Side Fault Injection #201

Merged
merged 13 commits into from
Feb 12, 2021
Merged

Conversation

lidizheng
Copy link
Contributor

@lidizheng lidizheng commented Aug 13, 2020


1. Fixed delay before forwarding the request;
1. Overridden gRPC status code;
1. Traffic matcher (upstream cluster name, downstream cluster name, and headers);
Copy link
Member

@dfawley dfawley Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any detail in here as to what is covered by this part of the feature.

In the envoy docs, I see:

  "upstream_cluster": "...",
  "headers": [],
  "downstream_nodes": [],

upstream_cluster makes sense - If set, we'd match on the cluster chosen by the config selector.

headers seems straightforward based on the envoy docs: "A match will happen if all the headers in the config are present in the request with the same values (or based on presence if the value field is not in the config)."

downstream_nodes, though, I'm not sure what to do with. Is this what you meant by "downstream cluster name"? For this it sounds like we would just match the value(s? - what if there are multiple) of the header "x-envoy-downstream-service-node" with the settings in "downstream_nodes" in the fault injection config.

But where does this header come from? It looks like it's set by envoy, not the user. Based on "the value is taken from the --service-node option" [ref]. We aren't setting this today as far as I'm aware, so does it make sense to support this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the traffic matcher, I updated the design doc that I think that only header-matching is common enough. Both the upstream cluster or downstream nodes features will postponed until users are asking for them.

The headers (or our metadata) can be set either by upstream Envoy or the application. I'm not sure how much will that make.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The headers (or our metadata) can be set either by upstream Envoy or the application.

The ones for the header matchers may be set by the application. But the one for the "downstream node" feature (which I now see we aren't supporting here) seems to be set only by Envoy, per the ref above. I'm not clear on what else this may be used for, and I am wondering if setting it is something we need to do in gRPC (or will need to do soon for other reasons).

Copy link
Contributor Author

@lidizheng lidizheng Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support this the downstream node matching, I think we will need to add this header injection as well. Technically, gRPC's XdsClient can add x-envoy-downstream-service-node or service-node header with it's Node info for all processed requests?

But, for this proposal, I hope to focus on the most common usages of the fault injection filter, which is delay, abort, match header, and max active faults.

A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
// delay feature is disabled and delay_fraction will be ignored.
google.protobuf.Duration delay = 1;

// The possibility that this RPC will be delayed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it say probability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, probability is more accurate here. Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even probability doesn't quite sound right in this context. What we are really talking about is, for example, out of 1 million RPCs delay will be introduced for these many RPCs (delay_fraction). If this is done on a random basis then probability applies to a given RPC being delayed because of this feature. However the number being described here is the "fraction" of RPCs (out of a million) that will be delayed by this feature.

A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, Lidi. My comments here are mostly about overall organization and updates based on the HTTP filter design.

Please let me know if you have any questions. Thanks!

A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@lidizheng lidizheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth Thanks for the advices! I have visited all comments, but would like to discuss whether to keep mentioning retry or not in the comment threads.

A33-Fault-Injection.md Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good! Just a few comments remaining.

A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
A33-Fault-Injection.md Outdated Show resolved Hide resolved
@markdroth
Copy link
Member

This looks great! Just one more thing: Please add a small section about the env var guard, similar to what I added in the HTTP filter gRFC in 7eaa156.

@lidizheng
Copy link
Contributor Author

@markdroth The env guard section is added. It also mentioned A39 since this flag will flip both features on.

Envoy implements fault injection is an `HTTPFilter`, which can choose to operate on bytes or HTTP messages or events. It is unclear if gRPC can perform the precise same behavior, since gRPC generally operates on proto messages (only Core has visibility to bytes, not Java/Golang). More importantly, there is not enough consensus for if the timeout timer should start at the beginning of an RPC, or the receipt of the first bytes from peer, or after the entire message is received.


### Environment variable protection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be in the "Alternatives" section. Please move it up to the "Proposal" section, probably line 120.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Good catch!

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change left, but otherwise this looks great! I'll go ahead and approve.

Thanks for doing this!

@@ -117,6 +117,11 @@ gRPC should accept fault injection settings with either HTTP status code or gRPC
According to the [Envoy implementation](https://github.com/envoyproxy/envoy/blob/main/source/extensions/filters/http/fault/fault_filter.cc#L222), each feature of fault injection is checked with a different random number. Compared to using a single random number for all features, this affects the probability of an RPC being affected by fault injection. For example, the fault injection has 20% chance to delay and 5% chance to abort. Under the single random number solution, only 20% RPC will be affected. But if two features are independent, 24% RPC will be affected. So, **this doc suggests to make each feature independent by using a different random number each time.**


### Environment Variable Protection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make changes similar to what I did in af7b012 to reflect the fact that this is just temporary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Using the same words as your commit.

@lidizheng
Copy link
Contributor Author

@dfawley @dapengzhang0 Added a section about the fault injection header percentage is capped by the effective fault injection filter config. Link to the implementation PR envoyproxy/envoy#10724

@lidizheng lidizheng merged commit 3b3c094 into grpc:master Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants