-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A33: Client-Side Fault Injection #201
Conversation
A33-Fault-Injection.md
Outdated
|
||
1. Fixed delay before forwarding the request; | ||
1. Overridden gRPC status code; | ||
1. Traffic matcher (upstream cluster name, downstream cluster name, and headers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't any detail in here as to what is covered by this part of the feature.
In the envoy docs, I see:
"upstream_cluster": "...",
"headers": [],
"downstream_nodes": [],
upstream_cluster
makes sense - If set, we'd match on the cluster chosen by the config selector.
headers
seems straightforward based on the envoy docs: "A match will happen if all the headers in the config are present in the request with the same values (or based on presence if the value field is not in the config)."
downstream_nodes
, though, I'm not sure what to do with. Is this what you meant by "downstream cluster name"? For this it sounds like we would just match the value(s? - what if there are multiple) of the header "x-envoy-downstream-service-node" with the settings in "downstream_nodes" in the fault injection config.
But where does this header come from? It looks like it's set by envoy, not the user. Based on "the value is taken from the --service-node
option" [ref]. We aren't setting this today as far as I'm aware, so does it make sense to support this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the traffic matcher, I updated the design doc that I think that only header-matching is common enough. Both the upstream cluster or downstream nodes features will postponed until users are asking for them.
The headers (or our metadata) can be set either by upstream Envoy or the application. I'm not sure how much will that make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The headers (or our metadata) can be set either by upstream Envoy or the application.
The ones for the header matchers may be set by the application. But the one for the "downstream node" feature (which I now see we aren't supporting here) seems to be set only by Envoy, per the ref above. I'm not clear on what else this may be used for, and I am wondering if setting it is something we need to do in gRPC (or will need to do soon for other reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support this the downstream node matching, I think we will need to add this header injection as well. Technically, gRPC's XdsClient can add x-envoy-downstream-service-node
or service-node
header with it's Node info for all processed requests?
But, for this proposal, I hope to focus on the most common usages of the fault injection filter, which is delay, abort, match header, and max active faults.
A33-Fault-Injection.md
Outdated
// delay feature is disabled and delay_fraction will be ignored. | ||
google.protobuf.Duration delay = 1; | ||
|
||
// The possibility that this RPC will be delayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it say probability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, probability is more accurate here. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even probability doesn't quite sound right in this context. What we are really talking about is, for example, out of 1 million RPCs delay will be introduced for these many RPCs (delay_fraction
). If this is done on a random basis then probability applies to a given RPC being delayed because of this feature. However the number being described here is the "fraction" of RPCs (out of a million) that will be delayed by this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, Lidi. My comments here are mostly about overall organization and updates based on the HTTP filter design.
Please let me know if you have any questions. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth Thanks for the advices! I have visited all comments, but would like to discuss whether to keep mentioning retry or not in the comment threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good! Just a few comments remaining.
This looks great! Just one more thing: Please add a small section about the env var guard, similar to what I added in the HTTP filter gRFC in 7eaa156. |
@markdroth The env guard section is added. It also mentioned A39 since this flag will flip both features on. |
A33-Fault-Injection.md
Outdated
Envoy implements fault injection is an `HTTPFilter`, which can choose to operate on bytes or HTTP messages or events. It is unclear if gRPC can perform the precise same behavior, since gRPC generally operates on proto messages (only Core has visibility to bytes, not Java/Golang). More importantly, there is not enough consensus for if the timeout timer should start at the beginning of an RPC, or the receipt of the first bytes from peer, or after the entire message is received. | ||
|
||
|
||
### Environment variable protection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be in the "Alternatives" section. Please move it up to the "Proposal" section, probably line 120.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor change left, but otherwise this looks great! I'll go ahead and approve.
Thanks for doing this!
A33-Fault-Injection.md
Outdated
@@ -117,6 +117,11 @@ gRPC should accept fault injection settings with either HTTP status code or gRPC | |||
According to the [Envoy implementation](https://github.com/envoyproxy/envoy/blob/main/source/extensions/filters/http/fault/fault_filter.cc#L222), each feature of fault injection is checked with a different random number. Compared to using a single random number for all features, this affects the probability of an RPC being affected by fault injection. For example, the fault injection has 20% chance to delay and 5% chance to abort. Under the single random number solution, only 20% RPC will be affected. But if two features are independent, 24% RPC will be affected. So, **this doc suggests to make each feature independent by using a different random number each time.** | |||
|
|||
|
|||
### Environment Variable Protection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make changes similar to what I did in af7b012 to reflect the fact that this is just temporary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Using the same words as your commit.
@dfawley @dapengzhang0 Added a section about the fault injection header percentage is capped by the effective fault injection filter config. Link to the implementation PR envoyproxy/envoy#10724 |
gRFC: https://github.com/lidizheng/proposal/blob/fault-injection/A33-Fault-Injection.md