-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FULL_SCAN selection mode to least request LB #31507
Add FULL_SCAN selection mode to least request LB #31507
Conversation
Hi @jkirschner-hashicorp, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
@wbpcode , @tonya11en : I realize I'll need to resolve the changelogs conflict and the commit sign-off. This is otherwise ready for your feedback. |
google.protobuf.BoolValue enable_full_scan = 5; | ||
// Method for selecting the host set from which to return the host with the fewest active requests. | ||
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default). | ||
SelectionMethod selection_method = 5 [(validate.rules).enum = {defined_only: true}]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any issues with modifying the meaning of field 5 when it was never included in an Envoy release (AFAIK)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a problem. The xDS protos are not used exclusively for Envoy, and they do get picked up by other data planes (e.g., gRPC) in between Envoy releases.
Please leave the existing field as-is, although you can mark it as deprecated if you want. The new field should be added with a new number, and you can document that the old field is ignored if the new field is set.
// The default number of choices for the LEAST_REQUEST policy is 2 hosts, if the number of hosts | ||
// is equal to the number of choices, a full scan happens instead, this means that the same host | ||
// will be chosen. | ||
if (is_hash_lb_ || (GetParam() == envoy::config::cluster::v3::Cluster::LEAST_REQUEST)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: check whether this patch from the original enable full scan PR is actually still needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing. No longer applicable, since full scan mode will require opt-in in all cases
Please fix DCO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
Lots of comments, but only the comments about not changing existing LB behavior are important. The LB's host selection behavior should only change if the FULL_SCAN method is set. The original full scan PR had to be reverted because it did this (see #31146 (review)).
// Available methods for selecting the host set from which to return the host with the | ||
// fewest active requests. | ||
enum SelectionMethod { | ||
POWER_OF_N_CHOICES = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this something like N_CHOICE
. The "power of" phrasing just comes from the title of one of the academic papers discussing this.
// The number of random healthy hosts from which the host with the fewest active requests will | ||
// be chosen. Defaults to 2 so that we perform two-choice selection if the field is not set. | ||
// Ignored if ``selection_method == FULL_SCAN``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future-proofing this comment, phrase it something like:
Only applies to the N_CHOICE selection method.
Just in case another selection method shows up.
message LeastRequest { | ||
// Available methods for selecting the host set from which to return the host with the | ||
// fewest active requests. | ||
enum SelectionMethod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some 1-liners explaining what it means to set these? I don't think it's explained anywhere what a FULL_SCAN method means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some guidance/recommendation on when full scan should be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barroca : Can you comment on your use case for full scan mode? I know it was different than mine. I'd like to mention both if applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have any use case for it :) I pickup the work since it looked simple enough for a learning opportunity to contribute to the project.
The use case I was trying to work on is when the number of hosts is very small and there is a probability that the selection of random hosts will lead to choosing the same one, or not choosing the one with least requests and leaving the load unbalanced.
But this is mostly true when we are dealing with small number of requests, because when you consider high throughput the probability of repeating random selection decreases and we balance out requests between hosts (according the paper used for the original algorithm).
The second implementation, instead of doing a full scan from the same index from the host list was to randomise the start index so we reduce even more the chance of choosing the same index when the number of requests are equal.
I guess the main usecase of a full scan is to guarantee that the request is sent to the host with least requests (which can be beneficial for some use cases like evenly splitting work on a map reduce).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a line that explains N_CHOICES
is best for most scenarios, and also explained the niche scenarios in which FULL_SCAN
may be preferable. Let me know what you think, @tonya11en and @ramaraochavali. (I think this is the last outstanding review comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you
// used to select the one with least requests instead of using random choices. | ||
google.protobuf.BoolValue enable_full_scan = 5; | ||
// Method for selecting the host set from which to return the host with the fewest active requests. | ||
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not say what choice_count
defaults to, since it's included in that field. If the default changes, someone would have to change this comment too.
I'd also move the explanation of what the POWER_OF_N_CHOICES means to the enum definition.
changelogs/current.yaml
Outdated
- area: upstream | ||
change: | | ||
Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least | ||
Request load balancer policy to be unfair when the number of hosts are very small, when the number | ||
of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we | ||
perform a full scan on it to choose the host with least requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is supposed to be included?
envoy::extensions::load_balancing_policies::least_request::v3::LeastRequest_SelectionMethod_FULL_SCAN; | ||
if (has_full_scan_configured || (hosts_to_use.size() <= choice_count_)) { | ||
// Choose a random index to start from preventing always picking the first host in the list. | ||
const int rand_idx = random_.random() % hosts_to_use.size(); | ||
for (unsigned long i = 0; i < hosts_to_use.size(); i++) { | ||
const HostSharedPtr& sampled_host = hosts_to_use[(rand_idx + i) % hosts_to_use.size()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to change the default behavior of the LB if FULL_SCAN is not set, so I'd remove the part that checks hosts_to_use.size() <= choice_count_
. Otherwise, this will break things for a lot of folks that depend on the subtle differences in selection probabilities when using the n-choice selection.
I think I go into the details in #31146 (review).
// - the number of choices is equal to or larger than the number of hosts. | ||
const auto has_full_scan_configured = selection_method_ == | ||
envoy::extensions::load_balancing_policies::least_request::v3::LeastRequest_SelectionMethod_FULL_SCAN; | ||
if (has_full_scan_configured || (hosts_to_use.size() <= choice_count_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pull this block out into its own function? It would probably look nice if you had a switch statement that calls the appropriate selection function.
|
||
// Host weight is 1. | ||
{ | ||
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3)); | ||
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)); | ||
EXPECT_EQ(hostSet().healthy_hosts_[0], lb_.chooseHost(nullptr)); | ||
} | ||
|
||
// Host weight is 100. | ||
{ | ||
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3)); | ||
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)); | ||
EXPECT_EQ(hostSet().healthy_hosts_[0], lb_.chooseHost(nullptr)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my comments about only changing behavior when FULL_SCAN is set, I don't think you need to modify any existing unit tests. All that should be needed is a new one that sets the new selection method and verifies sane behavior.
You might even be able to get away with unit testing just the selection method if it's in its own function.
To fix the DCO check, you'll need all your commits to be made via |
// Choose a random index to start from preventing always picking the first host in the list. | ||
const int rand_idx = random_.random() % hosts_to_use.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still going to be problematic as far as selection probabilities go. Consider some host vector with the following weights:
[9, 9, 1, 1, 1]
Choosing a random index to start the scan from would still choose the first host 80% of the time. The host at index 2 will only be picked 20% of the time, which seems unintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you walk me through why that happens?
I thought hosts_to_use.size()
would be 5, thinking it was just a std::vector
. Your comment makes me think something else is happening, like hosts_to_use.size()
returns a sum of weights? But presumably the array is still of size 5, so I shouldn't be able to index past [4] in the for-loop that iterates through hosts. I feel like I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the issue is with the active request counts, not weight. Problem is the same though.. see below.
So, suppose you have active request counts [9, 9, 9, 1, 1]. When picking a host, you're choosing a random index and full scanning through all 5 hosts. This implementation is keeping the host with the lowest active request count, so if there's a tie between hosts it will still choose the first one it finds. In this example, the only time you'd choose the host at index 4 is if your random index lands on it first. Otherwise, you'd choose the host at index 3, since 80% of the time it will be the first host with the lowest active rq count. Does that make sense?
I applied your patch and hacked together a simulation to show what I mean. For the example above, here are results:
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
I ran this for a few different combinations of active request counts and the results are consistent with my explanation above. Dig it:
[ RUN ] LeastRequestLoadBalancerWeightTest.TallenHax
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200265
: weight=1 active_rqs=1, pick_ratio=0.200148
: weight=1 active_rqs=1, pick_ratio=0.199813
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.399774
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.199856
: weight=1 active_rqs=1, pick_ratio=0.600027
: weight=1 active_rqs=1, pick_ratio=0.200117
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=1
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200368
: weight=1 active_rqs=1, pick_ratio=0.199323
: weight=1 active_rqs=1, pick_ratio=0.200177
: weight=1 active_rqs=1, pick_ratio=0.199968
: weight=1 active_rqs=1, pick_ratio=0.200164
It can be easily fixed, but it's gonna make the implementation a bit less performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense. If there's a tie for "least requests", then hosts with more non-least requests before them in the array will get more selection weight.
It can be easily fixed, but it's gonna make the implementation a bit less performant.
I haven't thought about the best way to easily fix this yet. If you have a suggestion in mind, I'm open to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some pseudo-code:
std::vector<HostPtr> candidates;
candidates.reserve(hosts_to_use.size());
for (const auto& host : hosts_to_use) {
const bool reset_candidates =
candidates->empty() || (host.active_rqs < candidates.front()->active_rqs);
if (reset_candidates) {
// Found a new minimum active request count.
candidates.clear();
candidates.push_back(host);
} else if (host.active_rqs == candidates.front()->active_rqs) {
// Found another host with the minimum number of active requests.
candidates.push_back(host);
}
}
return candidates[rand_idx % candidates.size()];
Basically just find all the hosts that share the smallest number of active requests and choose one of them at random. There's probably some fancy modern std::algorithm function that does this in a more elegant way, but you get the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have to change the random start index to be decoupled from the random number used to select the prime. I'll ping you again here when that's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - decoupled the random number used to select a large prime from the random number used to select the pre-increment array index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't give this the attention it deserves today, but I'll take a look and get back to you by Tuesday (because of the holiday).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally fine - thanks for expectation setting :) Enjoy the time away! We'll re-engage sometime next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind, this approach will always perform somewhere between the "random start index, iterate by 1" approach and the pseudo-code you wrote above in terms of following the pick ratios. In most cases, I think it will get close enough to the ideal that your pseudo-code achieves. However, there will be some cases where it looks closer to the "random start index, iterate by 1" approach.
The worst case that comes to mind is with 6 hosts that have active request counts of [9, 9, 9, 1, 1, 1]
. 6 hosts only has two possible increment values (1 and 5 are the only coprimes with 6 that are <6). 6 is the highest count with only 2 possible increment values (that happen to be +1 and -1 increment). In that case, the pick ratios will be:
- Indices 0-2: 0%
- Index 3: 41.66% (5/12)
- Index 4: 16.66% (2/12)
- Index 5: 41.66% (5/12)
The ideal pick ratio is that indices 3-5 will all get 33.33%.
The pick ratios should approach the ideal as the number of possible increment values increases (coprimes with the host count). Here's what the possible increment values look like by host count:
- 1: 0
- 2: 1
- 3: 1, 2
- 4: 1, 3
- 5: 1, 2, 3, 4
- 6: 1, 5
- 7: 1, 2, 3, 4, 5, 6
- 8: 1, 3, 5, 7
- 9: 1, 2, 4, 5, 7, 8
- 10: 1, 3, 7, 9
If it's deemed critical that the selection match the ideal pick ratio in all cases, then this approach is not great. However, we could potentially have two different modes in that case... one that is FULL_SCAN
that will follow the ideal pick ratio in all cases, and one that is FULL_SCAN_BIASED
with much lower memory overhead with that trade-off that it doesn't always follow the ideal pick ratio (and how much it deviates depends on the host count, how many ties for "least requests" exist, and how those ties are distributed in the host array).
Before the implementation, I think the first thing here is that we shouldn't change the API directly even it's marked as cc @envoyproxy/api-shepherds |
@wbpcode : Making sure I understand: I need to leave |
Could you just implement this feature with the exist API? is there any case that the bool connot cover? 🤔 |
Yes. However, I was asked by @tonya11en to use an enum instead of a bool to define the selection method. Let me know what field type you and @tonya11en agree upon, and then I'll proceed. |
If we're stuck with the Just a thought. The enum approach is already implemented, so both are readily available approaches. I defer to you both, @wbpcode and @tonya11en. |
@wbpcode if it's marked |
@tonya11en Envoy is not the only client that implement xDS. And API is the API, even now we regret it, we shouldn't change it directly after the 2-weeks window. If we want more options in future, deprecate it and add new field should be fine according to our API policy. But now, I didn't see the necessity. |
I had also taken |
Long time ago, I did a similar discussion with @markdroth , about a field that not be implemented in the envoy, but had been used in the grpc. Considering the grpc is the only one well-known non-envoy xds client, if @markdroth agree, it's possible to treat this field as a special case. But I think we still need a reasonable reason about this change. The only one reason that I can think of now is that we can add a new combine mode by extending enum in the future to enable full scan only when the host size is less than the choice count. |
I'd be really surprised if someone already added support for that boolean in other data planes, like gRPC. In the meantime, can we just assume it's fine to remove but wait to merge until we're sure? |
Yeah. Although basically we can ensure this field haven't been used. But the percent is not 100%, so, the only way to keep the API stable is don't do that. This is why we always need to be careful to the API change.
Okay. We can review the API and code first. But my bottom line is to use a new field ID rather than changing current field directly. And finally, we need a LGTM from @markdroth |
Reading between the lines of this conversation, it sounds like:
Is this accurate? If so, I'll proceed with making the changes from review, though I'd still benefit from a response to this comment thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I elaborated on my earlier comment about the selection probabilities being wonky. If you want to see the test I hacked together, it's here: tonya11en@b12317f
Let me know if anything still needs clarification. Thanks for being patient throughout all this.
// Choose a random index to start from preventing always picking the first host in the list. | ||
const int rand_idx = random_.random() % hosts_to_use.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the issue is with the active request counts, not weight. Problem is the same though.. see below.
So, suppose you have active request counts [9, 9, 9, 1, 1]. When picking a host, you're choosing a random index and full scanning through all 5 hosts. This implementation is keeping the host with the lowest active request count, so if there's a tie between hosts it will still choose the first one it finds. In this example, the only time you'd choose the host at index 4 is if your random index lands on it first. Otherwise, you'd choose the host at index 3, since 80% of the time it will be the first host with the lowest active rq count. Does that make sense?
I applied your patch and hacked together a simulation to show what I mean. For the example above, here are results:
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
I ran this for a few different combinations of active request counts and the results are consistent with my explanation above. Dig it:
[ RUN ] LeastRequestLoadBalancerWeightTest.TallenHax
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200265
: weight=1 active_rqs=1, pick_ratio=0.200148
: weight=1 active_rqs=1, pick_ratio=0.199813
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.399774
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.199856
: weight=1 active_rqs=1, pick_ratio=0.600027
: weight=1 active_rqs=1, pick_ratio=0.200117
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=1
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200368
: weight=1 active_rqs=1, pick_ratio=0.199323
: weight=1 active_rqs=1, pick_ratio=0.200177
: weight=1 active_rqs=1, pick_ratio=0.199968
: weight=1 active_rqs=1, pick_ratio=0.200164
It can be easily fixed, but it's gonna make the implementation a bit less performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this @jkirschner-hashicorp, docs review below
// used to select the one with least requests instead of using random choices. | ||
google.protobuf.BoolValue enable_full_scan = 5; | ||
// Method for selecting the host set from which to return the host with the fewest active requests. | ||
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default). | |
// Defaults to ``POWER_OF_N_CHOICES``, where ``N`` is determined by ``choice_count`` (2 by default). |
api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto
Show resolved
Hide resolved
changelogs/current.yaml
Outdated
- area: upstream | ||
change: | | ||
Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least | ||
Request load balancer policy to be unfair when the number of hosts are very small, when the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please break this sentence up, currently its not grammatically correct and the meaning is hard to decipher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be "fixed" by removing the entry entirely. Since we have decided to make full scan mode always require opt in (even if choice count > # hosts), this is no longer even an accurate/applicable entry.
changelogs/current.yaml
Outdated
change: | | ||
Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least | ||
Request load balancer policy to be unfair when the number of hosts are very small, when the number | ||
of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we | |
of hosts is smaller than the ``choice_count``, instead of randomly selection hosts from the list, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or better, use the link you have used below
changelogs/current.yaml
Outdated
- area: upstream | ||
change: | | ||
Added :ref:`selection_method <envoy_v3_api_msg_extensions.load_balancing_policies.least_request.v3.LeastRequest>` | ||
option to the least request load balancer. If set to FULL_SCAN, or if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option to the least request load balancer. If set to FULL_SCAN, or if | |
option to the Least Request load balancer. If set to ``FULL_SCAN``, or if |
for sure - as posted above - both are definitely unrelated - just an ongoing battle to reduce/tackle flakiness |
I went ahead and ran the test 10k times and it seems we're all good:
🎉 |
Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>
@tonya11en : Let me know if there are any more actions for me to take here. It looks like some CI steps failed, but don't seem to be related to changes introduced by this PR. Thank you for your input, feedback, and patience throughout! |
/retest @jkirschner-hashicorp there's nothing for you to do at this point. We just need a senior maintainer to review before the PR is eligible for a merge. Will rerun the tests in the meantime and cross fingers. |
/retest |
/lgtm api |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's up with the tests, but otherwise looks good.
There's now a merge conflict with the changelogs file. I suppose I can fix that, and hopefully that kicks off a new build that passes the Envoy/Windows stage. |
Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>
After addressing the changelog merge conflict (and making no other changes), the Envoy/Windows stage now passes, but some other stages are now failing for reasons that seem unrelated to this PR. |
/retest |
@phlax do you know what is going on with the failing coverage test? It doesn't have anything to do with this PR. |
I thikn this needs a main merge to fix coverage (and current.yaml) |
Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>
Fingers crossed that the |
Seems like the coverage check passed this time. However, one CI test flake occurred from the list of reported flakes (#32132): |
/retest |
Woo! All checks have passed! |
Commit Message: Add FULL_SCAN mode to least request load balancer.
By default, the least request load balancer returns the host with the fewest
active requests from a set of N randomly selected hosts.
This introduces a new "full scan" selection method that returns the host with
the fewest number of active requests from all hosts. If multiple hosts are
tied for "least", one of the tied hosts is randomly chosen.
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Added
selection_method
option to the least request load balancer. If set toFULL_SCAN
, Envoy will select the host with the fewest active requests from the entire host set rather thanchoice_count
random choices.Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]
This PR is a successor to #31146 and other efforts related to added a full scan mode to the least request load balancer.
This approach preserves automatically using full scan mode when choice_count > number_of_hosts. That functionality was previously removed in #30794 because it caused the first host to always be selected if cluster stats were disabled. However, that justification no longer applies because the start index of the full scan is now randomized.
Many thanks to @barroca for initiating this effort and getting it most of the way to the finish line.