A33: Client-Side Fault Injection #201

lidizheng · 2020-08-13T19:40:34Z

gRFC: https://github.com/lidizheng/proposal/blob/fault-injection/A33-Fault-Injection.md

dfawley · 2021-02-02T23:38:23Z

A33-Fault-Injection.md

+
+1. Fixed delay before forwarding the request;
+1. Overridden gRPC status code;
+1. Traffic matcher (upstream cluster name, downstream cluster name, and headers);


There isn't any detail in here as to what is covered by this part of the feature.

In the envoy docs, I see:

"upstream_cluster": "...", "headers": [], "downstream_nodes": [],

upstream_cluster makes sense - If set, we'd match on the cluster chosen by the config selector.

headers seems straightforward based on the envoy docs: "A match will happen if all the headers in the config are present in the request with the same values (or based on presence if the value field is not in the config)."

downstream_nodes, though, I'm not sure what to do with. Is this what you meant by "downstream cluster name"? For this it sounds like we would just match the value(s? - what if there are multiple) of the header "x-envoy-downstream-service-node" with the settings in "downstream_nodes" in the fault injection config.

But where does this header come from? It looks like it's set by envoy, not the user. Based on "the value is taken from the --service-node option" [ref]. We aren't setting this today as far as I'm aware, so does it make sense to support this feature?

For the traffic matcher, I updated the design doc that I think that only header-matching is common enough. Both the upstream cluster or downstream nodes features will postponed until users are asking for them.

The headers (or our metadata) can be set either by upstream Envoy or the application. I'm not sure how much will that make.

The headers (or our metadata) can be set either by upstream Envoy or the application.

The ones for the header matchers may be set by the application. But the one for the "downstream node" feature (which I now see we aren't supporting here) seems to be set only by Envoy, per the ref above. I'm not clear on what else this may be used for, and I am wondering if setting it is something we need to do in gRPC (or will need to do soon for other reasons).

To support this the downstream node matching, I think we will need to add this header injection as well. Technically, gRPC's XdsClient can add x-envoy-downstream-service-node or service-node header with it's Node info for all processed requests?

But, for this proposal, I hope to focus on the most common usages of the fault injection filter, which is delay, abort, match header, and max active faults.

A33-Fault-Injection.md

sanjaypujare · 2021-02-04T19:57:06Z

A33-Fault-Injection.md

+   // delay feature is disabled and delay_fraction will be ignored.
+   google.protobuf.Duration delay = 1;
+
+   // The possibility that this RPC will be delayed.


Should it say probability?

You are right, probability is more accurate here. Updated.

Even probability doesn't quite sound right in this context. What we are really talking about is, for example, out of 1 million RPCs delay will be introduced for these many RPCs (delay_fraction). If this is done on a random basis then probability applies to a given RPC being delayed because of this feature. However the number being described here is the "fraction" of RPCs (out of a million) that will be delayed by this feature.

A33-Fault-Injection.md

markdroth

Thanks for doing this, Lidi. My comments here are mostly about overall organization and updates based on the HTTP filter design.

Please let me know if you have any questions. Thanks!

A33-Fault-Injection.md

lidizheng

@markdroth Thanks for the advices! I have visited all comments, but would like to discuss whether to keep mentioning retry or not in the comment threads.

A33-Fault-Injection.md

markdroth

This looks pretty good! Just a few comments remaining.

A33-Fault-Injection.md

markdroth · 2021-02-10T18:01:52Z

This looks great! Just one more thing: Please add a small section about the env var guard, similar to what I added in the HTTP filter gRFC in 7eaa156.

lidizheng · 2021-02-10T18:38:42Z

@markdroth The env guard section is added. It also mentioned A39 since this flag will flip both features on.

markdroth · 2021-02-10T18:55:15Z

A33-Fault-Injection.md

+Envoy implements fault injection is an `HTTPFilter`, which can choose to operate on bytes or HTTP messages or events. It is unclear if gRPC can perform the precise same behavior, since gRPC generally operates on proto messages (only Core has visibility to bytes, not Java/Golang). More importantly, there is not enough consensus for if the timeout timer should start at the beginning of an RPC, or the receipt of the first bytes from peer, or after the entire message is received.
+
+
+### Environment variable protection


This shouldn't be in the "Alternatives" section. Please move it up to the "Proposal" section, probably line 120.

Updated. Good catch!

markdroth

One minor change left, but otherwise this looks great! I'll go ahead and approve.

Thanks for doing this!

markdroth · 2021-02-10T19:16:23Z

A33-Fault-Injection.md

@@ -117,6 +117,11 @@ gRPC should accept fault injection settings with either HTTP status code or gRPC
 According to the [Envoy implementation](https://github.com/envoyproxy/envoy/blob/main/source/extensions/filters/http/fault/fault_filter.cc#L222), each feature of fault injection is checked with a different random number. Compared to using a single random number for all features, this affects the probability of an RPC being affected by fault injection. For example, the fault injection has 20% chance to delay and 5% chance to abort. Under the single random number solution, only 20% RPC will be affected. But if two features are independent, 24% RPC will be affected. So, **this doc suggests to make each feature independent by using a different random number each time.**


+### Environment Variable Protection


Please make changes similar to what I did in af7b012 to reflect the fact that this is just temporary.

Done. Using the same words as your commit.

lidizheng · 2021-02-12T21:07:19Z

@dfawley @dapengzhang0 Added a section about the fault injection header percentage is capped by the effective fault injection filter config. Link to the implementation PR envoyproxy/envoy#10724

lidizheng added 2 commits August 13, 2020 12:38

Add gRPC A33: Client-Side Fault Injection

b9b6e47

Add grpc-io link

443cb4a

lidizheng marked this pull request as ready for review August 13, 2020 19:46

lidizheng requested review from dfawley, ejona86 and markdroth August 13, 2020 19:47

dapengzhang0 mentioned this pull request Jan 25, 2021

xds: parse HttpFault filter from LDS/RDS response grpc/grpc-java#7841

Merged

dfawley reviewed Feb 2, 2021

View reviewed changes

Clearify we only want to support the traffic matcher based on headers

c242ad7

markdroth reviewed Feb 3, 2021

View reviewed changes

A33-Fault-Injection.md Show resolved Hide resolved

Refer to A39 for HTTP filter processing

fd9aac0

sanjaypujare reviewed Feb 4, 2021

View reviewed changes