Add FULL_SCAN selection mode to least request LB #31507

jkirschner-hashicorp · 2023-12-23T22:01:22Z

Commit Message: Add FULL_SCAN mode to least request load balancer.

By default, the least request load balancer returns the host with the fewest
active requests from a set of N randomly selected hosts.

This introduces a new "full scan" selection method that returns the host with
the fewest number of active requests from all hosts. If multiple hosts are
tied for "least", one of the tied hosts is randomly chosen.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:

Added selection_method option to the least request load balancer. If set to FULL_SCAN, Envoy will select the host with the fewest active requests from the entire host set rather than choice_count random choices.

Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

This PR is a successor to #31146 and other efforts related to added a full scan mode to the least request load balancer.

This approach preserves automatically using full scan mode when choice_count > number_of_hosts. That functionality was previously removed in #30794 because it caused the first host to always be selected if cluster stats were disabled. However, that justification no longer applies because the start index of the full scan is now randomized.

Many thanks to @barroca for initiating this effort and getting it most of the way to the finish line.

repokitteh-read-only · 2023-12-23T22:01:27Z

Hi @jkirschner-hashicorp, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #31507 was opened by jkirschner-hashicorp.

see: more, trace.

repokitteh-read-only · 2023-12-23T22:01:32Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #31507 was opened by jkirschner-hashicorp.

see: more, trace.

jkirschner-hashicorp · 2023-12-23T22:03:54Z

@wbpcode , @tonya11en : I realize I'll need to resolve the changelogs conflict and the commit sign-off. This is otherwise ready for your feedback.

source/common/upstream/load_balancer_impl.h

jkirschner-hashicorp · 2023-12-23T22:14:11Z

api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto

-  google.protobuf.BoolValue enable_full_scan = 5;
+  // Method for selecting the host set from which to return the host with the fewest active requests.
+  // Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).
+  SelectionMethod selection_method = 5 [(validate.rules).enum = {defined_only: true}];


Are there any issues with modifying the meaning of field 5 when it was never included in an Envoy release (AFAIK)?

Yes, this is a problem. The xDS protos are not used exclusively for Envoy, and they do get picked up by other data planes (e.g., gRPC) in between Envoy releases.

Please leave the existing field as-is, although you can mark it as deprecated if you want. The new field should be added with a new number, and you can document that the old field is ignored if the new field is set.

jkirschner-hashicorp · 2023-12-25T22:29:44Z

test/integration/http_subset_lb_integration_test.cc

+    // The default number of choices for the LEAST_REQUEST policy is 2 hosts, if the number of hosts
+    // is equal to the number of choices, a full scan happens instead, this means that the same host
+    // will be chosen.
+    if (is_hash_lb_ || (GetParam() == envoy::config::cluster::v3::Cluster::LEAST_REQUEST)) {


Note to self: check whether this patch from the original enable full scan PR is actually still needed

Removing. No longer applicable, since full scan mode will require opt-in in all cases

adisuissa · 2024-01-02T15:53:16Z

Please fix DCO.
Assigning Tony for a first-pass review.
/assign @tonya11en

tonya11en

Thanks for doing this!

Lots of comments, but only the comments about not changing existing LB behavior are important. The LB's host selection behavior should only change if the FULL_SCAN method is set. The original full scan PR had to be reverted because it did this (see #31146 (review)).

tonya11en · 2024-01-02T18:49:16Z

api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto

+  // Available methods for selecting the host set from which to return the host with the
+  // fewest active requests.
+  enum SelectionMethod {
+    POWER_OF_N_CHOICES = 0;


I'd call this something like N_CHOICE. The "power of" phrasing just comes from the title of one of the academic papers discussing this.

tonya11en · 2024-01-02T18:56:23Z

api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto

  // The number of random healthy hosts from which the host with the fewest active requests will
  // be chosen. Defaults to 2 so that we perform two-choice selection if the field is not set.
+  // Ignored if ``selection_method == FULL_SCAN``.


For future-proofing this comment, phrase it something like:
Only applies to the N_CHOICE selection method.

Just in case another selection method shows up.

tonya11en · 2024-01-02T18:58:16Z

api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto

 message LeastRequest {
+  // Available methods for selecting the host set from which to return the host with the
+  // fewest active requests.
+  enum SelectionMethod {


Can you add some 1-liners explaining what it means to set these? I don't think it's explained anywhere what a FULL_SCAN method means.

Can you please add some guidance/recommendation on when full scan should be used?

@barroca : Can you comment on your use case for full scan mode? I know it was different than mine. I'd like to mention both if applicable.

I didn't have any use case for it :) I pickup the work since it looked simple enough for a learning opportunity to contribute to the project.

The use case I was trying to work on is when the number of hosts is very small and there is a probability that the selection of random hosts will lead to choosing the same one, or not choosing the one with least requests and leaving the load unbalanced.

But this is mostly true when we are dealing with small number of requests, because when you consider high throughput the probability of repeating random selection decreases and we balance out requests between hosts (according the paper used for the original algorithm).

The second implementation, instead of doing a full scan from the same index from the host list was to randomise the start index so we reduce even more the chance of choosing the same index when the number of requests are equal.

I guess the main usecase of a full scan is to guarantee that the request is sent to the host with least requests (which can be beneficial for some use cases like evenly splitting work on a map reduce).

Just added a line that explains N_CHOICES is best for most scenarios, and also explained the niche scenarios in which FULL_SCAN may be preferable. Let me know what you think, @tonya11en and @ramaraochavali. (I think this is the last outstanding review comment)

tonya11en · 2024-01-02T18:59:37Z

api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto

-  // used to select the one with least requests instead of using random choices.
-  google.protobuf.BoolValue enable_full_scan = 5;
+  // Method for selecting the host set from which to return the host with the fewest active requests.
+  // Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).


Let's not say what choice_count defaults to, since it's included in that field. If the default changes, someone would have to change this comment too.

I'd also move the explanation of what the POWER_OF_N_CHOICES means to the enum definition.

tonya11en · 2024-01-02T19:01:21Z

changelogs/current.yaml

+- area: upstream
+  change: |
+    Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least
+    Request load balancer policy to be unfair when the number of hosts are very small, when the number
+    of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we
+    perform a full scan on it to choose the host with least requests.


I don't think this is supposed to be included?

source/common/upstream/load_balancer_impl.h

tonya11en · 2024-01-02T19:14:26Z

source/common/upstream/load_balancer_impl.cc

+                                        envoy::extensions::load_balancing_policies::least_request::v3::LeastRequest_SelectionMethod_FULL_SCAN;
+  if (has_full_scan_configured || (hosts_to_use.size() <= choice_count_)) {
+    // Choose a random index to start from preventing always picking the first host in the list.
+    const int rand_idx = random_.random() % hosts_to_use.size();
+    for (unsigned long i = 0; i < hosts_to_use.size(); i++) {
+      const HostSharedPtr& sampled_host = hosts_to_use[(rand_idx + i) % hosts_to_use.size()];


We don't want to change the default behavior of the LB if FULL_SCAN is not set, so I'd remove the part that checks hosts_to_use.size() <= choice_count_. Otherwise, this will break things for a lot of folks that depend on the subtle differences in selection probabilities when using the n-choice selection.

I think I go into the details in #31146 (review).

tonya11en · 2024-01-02T19:16:12Z

source/common/upstream/load_balancer_impl.cc

+  // - the number of choices is equal to or larger than the number of hosts.
+  const auto has_full_scan_configured = selection_method_ ==
+                                        envoy::extensions::load_balancing_policies::least_request::v3::LeastRequest_SelectionMethod_FULL_SCAN;
+  if (has_full_scan_configured || (hosts_to_use.size() <= choice_count_)) {


Can you pull this block out into its own function? It would probably look nice if you had a switch statement that calls the appropriate selection function.

tonya11en · 2024-01-02T19:20:20Z

test/common/upstream/load_balancer_impl_test.cc


  // Host weight is 1.
  {
-    EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
+    EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2));
    EXPECT_EQ(hostSet().healthy_hosts_[0], lb_.chooseHost(nullptr));
  }

  // Host weight is 100.
  {
-    EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
+    EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2));
    EXPECT_EQ(hostSet().healthy_hosts_[0], lb_.chooseHost(nullptr));
  }



Based on my comments about only changing behavior when FULL_SCAN is set, I don't think you need to modify any existing unit tests. All that should be needed is a new one that sets the new selection method and verifies sane behavior.

You might even be able to get away with unit testing just the selection method if it's in its own function.

tonya11en · 2024-01-02T19:26:47Z

To fix the DCO check, you'll need all your commits to be made via git commit -s so you have the "signed off by..." at the bottom. You can just squash them all together, amend the commit, and force push it to your branch. That should do the trick.

tonya11en · 2024-01-02T19:36:22Z

source/common/upstream/load_balancer_impl.cc

+    // Choose a random index to start from preventing always picking the first host in the list.
+    const int rand_idx = random_.random() % hosts_to_use.size();


This is still going to be problematic as far as selection probabilities go. Consider some host vector with the following weights:

[9, 9, 1, 1, 1]

Choosing a random index to start the scan from would still choose the first host 80% of the time. The host at index 2 will only be picked 20% of the time, which seems unintuitive.

Can you walk me through why that happens?

I thought hosts_to_use.size() would be 5, thinking it was just a std::vector. Your comment makes me think something else is happening, like hosts_to_use.size() returns a sum of weights? But presumably the array is still of size 5, so I shouldn't be able to index past [4] in the for-loop that iterates through hosts. I feel like I'm missing something.

Sorry, the issue is with the active request counts, not weight. Problem is the same though.. see below.

So, suppose you have active request counts [9, 9, 9, 1, 1]. When picking a host, you're choosing a random index and full scanning through all 5 hosts. This implementation is keeping the host with the lowest active request count, so if there's a tie between hosts it will still choose the first one it finds. In this example, the only time you'd choose the host at index 4 is if your random index lands on it first. Otherwise, you'd choose the host at index 3, since 80% of the time it will be the first host with the lowest active rq count. Does that make sense?

I applied your patch and hacked together a simulation to show what I mean. For the example above, here are results:

n=1000000 : weight=1 active_rqs=1, pick_ratio=0.199839 : weight=1 active_rqs=1, pick_ratio=0.800161 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0

I ran this for a few different combinations of active request counts and the results are consistent with my explanation above. Dig it:

[ RUN ] LeastRequestLoadBalancerWeightTest.TallenHax ================================ n=1000000 : weight=1 active_rqs=1, pick_ratio=0.200265 : weight=1 active_rqs=1, pick_ratio=0.200148 : weight=1 active_rqs=1, pick_ratio=0.199813 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=1, pick_ratio=0.399774 ================================ n=1000000 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=1, pick_ratio=0.199856 : weight=1 active_rqs=1, pick_ratio=0.600027 : weight=1 active_rqs=1, pick_ratio=0.200117 : weight=1 active_rqs=9, pick_ratio=0 ================================ n=1000000 : weight=1 active_rqs=1, pick_ratio=0.199839 : weight=1 active_rqs=1, pick_ratio=0.800161 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 ================================ n=1000000 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=1, pick_ratio=1 : weight=1 active_rqs=9, pick_ratio=0 ================================ n=1000000 : weight=1 active_rqs=1, pick_ratio=0.200368 : weight=1 active_rqs=1, pick_ratio=0.199323 : weight=1 active_rqs=1, pick_ratio=0.200177 : weight=1 active_rqs=1, pick_ratio=0.199968 : weight=1 active_rqs=1, pick_ratio=0.200164

It can be easily fixed, but it's gonna make the implementation a bit less performant.

Ah, that makes sense. If there's a tie for "least requests", then hosts with more non-least requests before them in the array will get more selection weight.

It can be easily fixed, but it's gonna make the implementation a bit less performant.

I haven't thought about the best way to easily fix this yet. If you have a suggestion in mind, I'm open to it.

Here's some pseudo-code:

std::vector<HostPtr> candidates; candidates.reserve(hosts_to_use.size()); for (const auto& host : hosts_to_use) { const bool reset_candidates = candidates->empty() || (host.active_rqs < candidates.front()->active_rqs); if (reset_candidates) { // Found a new minimum active request count. candidates.clear(); candidates.push_back(host); } else if (host.active_rqs == candidates.front()->active_rqs) { // Found another host with the minimum number of active requests. candidates.push_back(host); } } return candidates[rand_idx % candidates.size()];

Basically just find all the hosts that share the smallest number of active requests and choose one of them at random. There's probably some fancy modern std::algorithm function that does this in a more elegant way, but you get the idea.

Actually, I have to change the random start index to be decoupled from the random number used to select the prime. I'll ping you again here when that's done.

Done - decoupled the random number used to select a large prime from the random number used to select the pre-increment array index.

Can't give this the attention it deserves today, but I'll take a look and get back to you by Tuesday (because of the holiday).

That's totally fine - thanks for expectation setting :) Enjoy the time away! We'll re-engage sometime next week.

Keep in mind, this approach will always perform somewhere between the "random start index, iterate by 1" approach and the pseudo-code you wrote above in terms of following the pick ratios. In most cases, I think it will get close enough to the ideal that your pseudo-code achieves. However, there will be some cases where it looks closer to the "random start index, iterate by 1" approach.

The worst case that comes to mind is with 6 hosts that have active request counts of [9, 9, 9, 1, 1, 1]. 6 hosts only has two possible increment values (1 and 5 are the only coprimes with 6 that are <6). 6 is the highest count with only 2 possible increment values (that happen to be +1 and -1 increment). In that case, the pick ratios will be:

Indices 0-2: 0%

Index 3: 41.66% (5/12)

Index 4: 16.66% (2/12)

Index 5: 41.66% (5/12)

The ideal pick ratio is that indices 3-5 will all get 33.33%.

The pick ratios should approach the ideal as the number of possible increment values increases (coprimes with the host count). Here's what the possible increment values look like by host count:

1: 0

2: 1

3: 1, 2

4: 1, 3

5: 1, 2, 3, 4

6: 1, 5

7: 1, 2, 3, 4, 5, 6

8: 1, 3, 5, 7

9: 1, 2, 4, 5, 7, 8

10: 1, 3, 7, 9

If it's deemed critical that the selection match the ideal pick ratio in all cases, then this approach is not great. However, we could potentially have two different modes in that case... one that is FULL_SCAN that will follow the ideal pick ratio in all cases, and one that is FULL_SCAN_BIASED with much lower memory overhead with that trade-off that it doesn't always follow the ideal pick ratio (and how much it deviates depends on the host count, how many ties for "least requests" exist, and how those ties are distributed in the host array).

wbpcode · 2024-01-03T01:43:25Z

Before the implementation, I think the first thing here is that we shouldn't change the API directly even it's marked as not-implemented. The API is introduced in the #29873 and we have missed the window in which we can change the API.

cc @envoyproxy/api-shepherds

jkirschner-hashicorp · 2024-01-03T14:50:01Z

@wbpcode : Making sure I understand: I need to leave enable_full_scan as the not-implemented field 5 of the protobuf, making selection_method a new field 6?

wbpcode · 2024-01-03T14:53:49Z

Could you just implement this feature with the exist API? is there any case that the bool connot cover? 🤔

jkirschner-hashicorp · 2024-01-03T16:26:43Z

Could you just implement this feature with the exist API?

Yes. However, I was asked by @tonya11en to use an enum instead of a bool to define the selection method.

Let me know what field type you and @tonya11en agree upon, and then I'll proceed.

jkirschner-hashicorp · 2024-01-03T16:35:14Z

If we're stuck with the enable_full_scan field anyways (because we can't remove it), then perhaps it would be best to just use that field? And if a 3rd selection method is added in the future, the enable_full_scan field can be deprecated at that time (and an enum can be added)?

Just a thought. The enum approach is already implemented, so both are readily available approaches. I defer to you both, @wbpcode and @tonya11en.

tonya11en · 2024-01-03T19:09:36Z

@wbpcode if it's marked [#not-implemented-hide:] how would anything be affected by its removal? The docs have excluded any mention of the field, so there should be no valid uses in the wild.

wbpcode · 2024-01-04T01:55:31Z

@tonya11en Envoy is not the only client that implement xDS. And API is the API, even now we regret it, we shouldn't change it directly after the 2-weeks window.

If we want more options in future, deprecate it and add new field should be fine according to our API policy. But now, I didn't see the necessity.

htuch · 2024-01-04T14:51:48Z

I had also taken [#not-implemented-hide:] to mean WiP, and just checked what our style / version guides say. It's mentioned in https://github.com/envoyproxy/envoy/blob/main/api/STYLE.md as being Envoy-specific, but let's check with @markdroth on this one. We should clarify in the API style guides the semantics of this annotation in a multi-dataplane world.

wbpcode · 2024-01-04T15:45:46Z

Long time ago, I did a similar discussion with @markdroth , about a field that not be implemented in the envoy, but had been used in the grpc.

Considering the grpc is the only one well-known non-envoy xds client, if @markdroth agree, it's possible to treat this field as a special case.

But I think we still need a reasonable reason about this change.

The only one reason that I can think of now is that we can add a new combine mode by extending enum in the future to enable full scan only when the host size is less than the choice count.
Not sure if this could be a appropriate reason. : 🤔

tonya11en · 2024-01-04T19:49:26Z

I'd be really surprised if someone already added support for that boolean in other data planes, like gRPC. In the meantime, can we just assume it's fine to remove but wait to merge until we're sure?

wbpcode · 2024-01-05T02:12:56Z

I'd be really surprised if someone already added support for that boolean in other data planes, like gRPC.

Yeah. Although basically we can ensure this field haven't been used. But the percent is not 100%, so, the only way to keep the API stable is don't do that. This is why we always need to be careful to the API change.

In the meantime, can we just assume it's fine to remove but wait to merge until we're sure?

Okay. We can review the API and code first. But my bottom line is to use a new field ID rather than changing current field directly. And finally, we need a LGTM from @markdroth

jkirschner-hashicorp · 2024-01-05T14:20:51Z

Reading between the lines of this conversation, it sounds like:

We want to use an enum rather than a Bool field
The enum field should be a new field, leaving the previous Bool for as "not implemented"

Is this accurate? If so, I'll proceed with making the changes from review, though I'd still benefit from a response to this comment thread.

tonya11en

I elaborated on my earlier comment about the selection probabilities being wonky. If you want to see the test I hacked together, it's here: tonya11en@b12317f

Let me know if anything still needs clarification. Thanks for being patient throughout all this.

tonya11en · 2024-01-05T18:57:57Z

source/common/upstream/load_balancer_impl.cc

+    // Choose a random index to start from preventing always picking the first host in the list.
+    const int rand_idx = random_.random() % hosts_to_use.size();


Sorry, the issue is with the active request counts, not weight. Problem is the same though.. see below.

So, suppose you have active request counts [9, 9, 9, 1, 1]. When picking a host, you're choosing a random index and full scanning through all 5 hosts. This implementation is keeping the host with the lowest active request count, so if there's a tie between hosts it will still choose the first one it finds. In this example, the only time you'd choose the host at index 4 is if your random index lands on it first. Otherwise, you'd choose the host at index 3, since 80% of the time it will be the first host with the lowest active rq count. Does that make sense?

I applied your patch and hacked together a simulation to show what I mean. For the example above, here are results:

n=1000000 : weight=1 active_rqs=1, pick_ratio=0.199839 : weight=1 active_rqs=1, pick_ratio=0.800161 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0

I ran this for a few different combinations of active request counts and the results are consistent with my explanation above. Dig it:

[ RUN ] LeastRequestLoadBalancerWeightTest.TallenHax ================================ n=1000000 : weight=1 active_rqs=1, pick_ratio=0.200265 : weight=1 active_rqs=1, pick_ratio=0.200148 : weight=1 active_rqs=1, pick_ratio=0.199813 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=1, pick_ratio=0.399774 ================================ n=1000000 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=1, pick_ratio=0.199856 : weight=1 active_rqs=1, pick_ratio=0.600027 : weight=1 active_rqs=1, pick_ratio=0.200117 : weight=1 active_rqs=9, pick_ratio=0 ================================ n=1000000 : weight=1 active_rqs=1, pick_ratio=0.199839 : weight=1 active_rqs=1, pick_ratio=0.800161 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 ================================ n=1000000 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=9, pick_ratio=0 : weight=1 active_rqs=1, pick_ratio=1 : weight=1 active_rqs=9, pick_ratio=0 ================================ n=1000000 : weight=1 active_rqs=1, pick_ratio=0.200368 : weight=1 active_rqs=1, pick_ratio=0.199323 : weight=1 active_rqs=1, pick_ratio=0.200177 : weight=1 active_rqs=1, pick_ratio=0.199968 : weight=1 active_rqs=1, pick_ratio=0.200164

It can be easily fixed, but it's gonna make the implementation a bit less performant.

phlax

thanks for working on this @jkirschner-hashicorp, docs review below

phlax · 2024-01-08T09:29:43Z

api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto

-  // used to select the one with least requests instead of using random choices.
-  google.protobuf.BoolValue enable_full_scan = 5;
+  // Method for selecting the host set from which to return the host with the fewest active requests.
+  // Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).


Suggested change

// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).

// Defaults to ``POWER_OF_N_CHOICES``, where ``N`` is determined by ``choice_count`` (2 by default).

api/envoy/extensions/load_balancing_policies/least_request/v3/least_request.proto

phlax · 2024-01-08T09:32:35Z

changelogs/current.yaml

+- area: upstream
+  change: |
+    Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least
+    Request load balancer policy to be unfair when the number of hosts are very small, when the number


Please break this sentence up, currently its not grammatically correct and the meaning is hard to decipher

This will be "fixed" by removing the entry entirely. Since we have decided to make full scan mode always require opt in (even if choice count > # hosts), this is no longer even an accurate/applicable entry.

phlax · 2024-01-08T09:32:47Z

changelogs/current.yaml

+  change: |
+    Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least
+    Request load balancer policy to be unfair when the number of hosts are very small, when the number
+    of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we


Suggested change

of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we

of hosts is smaller than the ``choice_count``, instead of randomly selection hosts from the list, we

or better, use the link you have used below

phlax · 2024-01-08T09:34:25Z

changelogs/current.yaml

+- area: upstream
+  change: |
+    Added :ref:`selection_method <envoy_v3_api_msg_extensions.load_balancing_policies.least_request.v3.LeastRequest>`
+    option to the least request load balancer. If set to FULL_SCAN, or if 


Suggested change

option to the least request load balancer. If set to FULL_SCAN, or if

option to the Least Request load balancer. If set to ``FULL_SCAN``, or if

phlax · 2024-01-25T18:23:46Z

for sure - as posted above - both are definitely unrelated - just an ongoing battle to reduce/tackle flakiness

tonya11en · 2024-01-26T00:54:51Z

I went ahead and ran the test 10k times and it seems we're all good:

 txallen@stormbeard  ~/src/envoy   jkirschner-hashicorp-least_request_lb_enable_full_scan_mode  bazel test --cache_test_results=no //test/common/upstream:load_balancer_impl_test --test_arg=--gtest_filter=LeastRequestLoadBalancerTest.FullScanMultipleHostsWithLeastRequests --runs_per_test=10000 --local_test_jobs=1
INFO: Invocation ID: 4477bf37-d686-4220-a0a4-32f870c78a9c
INFO: Analyzed target //test/common/upstream:load_balancer_impl_test (0 packages loaded, 532 targets configured).
INFO: Found 1 test target...
Target //test/common/upstream:load_balancer_impl_test up-to-date:
  bazel-bin/test/common/upstream/load_balancer_impl_test
INFO: Elapsed time: 44.994s, Critical Path: 3.31s
INFO: 10001 processes: 1 internal, 10000 remote.
INFO: Build completed successfully, 10001 total actions

Executed 1 out of 1 test: 1 test passes.

🎉

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>

jkirschner-hashicorp · 2024-01-26T19:54:08Z

@tonya11en : Let me know if there are any more actions for me to take here. It looks like some CI steps failed, but don't seem to be related to changes introduced by this PR. Thank you for your input, feedback, and patience throughout!

tonya11en · 2024-01-26T19:57:35Z

/retest

@jkirschner-hashicorp there's nothing for you to do at this point. We just need a senior maintainer to review before the PR is eligible for a merge.

Will rerun the tests in the meantime and cross fingers.

tonya11en · 2024-01-26T19:58:09Z

/retest

markdroth · 2024-01-26T23:30:13Z

/lgtm api

zuercher · 2024-01-30T05:35:35Z

/retest

zuercher

Not sure what's up with the tests, but otherwise looks good.

jkirschner-hashicorp · 2024-01-31T18:48:03Z

There's now a merge conflict with the changelogs file. I suppose I can fix that, and hopefully that kicks off a new build that passes the Envoy/Windows stage.

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>

jkirschner-hashicorp · 2024-01-31T21:51:56Z

After addressing the changelog merge conflict (and making no other changes), the Envoy/Windows stage now passes, but some other stages are now failing for reasons that seem unrelated to this PR.

tonya11en · 2024-01-31T22:11:48Z

/retest

jkirschner-hashicorp · 2024-02-02T00:30:07Z

It looks like the only failure on this run was related to code coverage. But I'm not sure what this PR has to do with source/common/io. Any ideas?

tonya11en · 2024-02-02T00:52:47Z

@phlax do you know what is going on with the failing coverage test? It doesn't have anything to do with this PR.

alyssawilk · 2024-02-06T13:51:42Z

I thikn this needs a main merge to fix coverage (and current.yaml)
/wait

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>

jkirschner-hashicorp · 2024-02-06T16:42:01Z

Fingers crossed that the main merge fixes the code coverage strangeness.

jkirschner-hashicorp · 2024-02-06T20:08:06Z

Seems like the coverage check passed this time.

However, one CI test flake occurred from the list of reported flakes (#32132):

tonya11en · 2024-02-07T16:16:12Z

/retest

jkirschner-hashicorp · 2024-02-07T17:30:07Z

Woo! All checks have passed!

repokitteh-read-only bot added the api label Dec 23, 2023

repokitteh-read-only bot assigned wbpcode Dec 23, 2023

jkirschner-hashicorp mentioned this pull request Dec 23, 2023

Start Full Scan from a random index for Least Request LB. #31146

Closed

jkirschner-hashicorp commented Dec 23, 2023

View reviewed changes

source/common/upstream/load_balancer_impl.h Show resolved Hide resolved

jkirschner-hashicorp commented Dec 23, 2023

View reviewed changes

jkirschner-hashicorp commented Dec 25, 2023

View reviewed changes

repokitteh-read-only bot assigned tonya11en Jan 2, 2024

tonya11en suggested changes Jan 2, 2024

View reviewed changes

tonya11en reviewed Jan 2, 2024

View reviewed changes

tonya11en suggested changes Jan 5, 2024

View reviewed changes

wbpcode assigned markdroth and unassigned wbpcode Jan 8, 2024

phlax reviewed Jan 8, 2024

View reviewed changes

Clarify FULL_SCAN method use cases

53ebe90

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>

tonya11en approved these changes Jan 26, 2024

View reviewed changes

repokitteh-read-only bot removed the api label Jan 26, 2024

zuercher previously approved these changes Jan 30, 2024

View reviewed changes

Merge branch 'main' into least_request_lb_enable_full_scan_mode

209a620

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>

jkirschner-hashicorp dismissed zuercher’s stale review via 209a620 January 31, 2024 19:02

repokitteh-read-only bot added the waiting label Feb 6, 2024

Merge branch 'main' into least_request_lb_enable_full_scan_mode

825bd5b

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>

repokitteh-read-only bot removed the waiting label Feb 6, 2024

wbpcode mentioned this pull request Feb 7, 2024

ext_proc: attributes in first message to server jbohanon/envoy#1

Merged

barroca approved these changes Feb 7, 2024

View reviewed changes

zuercher approved these changes Feb 9, 2024

View reviewed changes

zuercher merged commit 1995d92 into envoyproxy:main Feb 9, 2024
55 checks passed

		// Choose a random index to start from preventing always picking the first host in the list.
		const int rand_idx = random_.random() % hosts_to_use.size();

	// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).
	// Defaults to ``POWER_OF_N_CHOICES``, where ``N`` is determined by ``choice_count`` (2 by default).

	of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we
	of hosts is smaller than the ``choice_count``, instead of randomly selection hosts from the list, we

	option to the least request load balancer. If set to FULL_SCAN, or if
	option to the Least Request load balancer. If set to ``FULL_SCAN``, or if

Add FULL_SCAN selection mode to least request LB #31507

Add FULL_SCAN selection mode to least request LB #31507

Conversation

jkirschner-hashicorp commented Dec 23, 2023 • edited Loading

repokitteh-read-only bot commented Dec 23, 2023

repokitteh-read-only bot commented Dec 23, 2023

jkirschner-hashicorp commented Dec 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adisuissa commented Jan 2, 2024

tonya11en left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonya11en commented Jan 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbpcode commented Jan 3, 2024

jkirschner-hashicorp commented Jan 3, 2024

wbpcode commented Jan 3, 2024 • edited Loading

jkirschner-hashicorp commented Jan 3, 2024

jkirschner-hashicorp commented Jan 3, 2024

tonya11en commented Jan 3, 2024

wbpcode commented Jan 4, 2024 • edited Loading

htuch commented Jan 4, 2024

wbpcode commented Jan 4, 2024 • edited Loading

tonya11en commented Jan 4, 2024

wbpcode commented Jan 5, 2024

jkirschner-hashicorp commented Jan 5, 2024

tonya11en left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phlax left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phlax commented Jan 25, 2024

tonya11en commented Jan 26, 2024

jkirschner-hashicorp commented Jan 26, 2024

tonya11en commented Jan 26, 2024

tonya11en commented Jan 26, 2024

markdroth commented Jan 26, 2024

zuercher commented Jan 30, 2024

zuercher left a comment

Choose a reason for hiding this comment

jkirschner-hashicorp commented Jan 31, 2024

jkirschner-hashicorp commented Jan 31, 2024

tonya11en commented Jan 31, 2024

jkirschner-hashicorp commented Feb 2, 2024

tonya11en commented Feb 2, 2024

alyssawilk commented Feb 6, 2024

jkirschner-hashicorp commented Feb 6, 2024

jkirschner-hashicorp commented Feb 6, 2024

tonya11en commented Feb 7, 2024

jkirschner-hashicorp commented Feb 7, 2024

jkirschner-hashicorp commented Dec 23, 2023 •

edited

Loading

wbpcode commented Jan 3, 2024 •

edited

Loading

wbpcode commented Jan 4, 2024 •

edited

Loading

wbpcode commented Jan 4, 2024 •

edited

Loading