Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FULL_SCAN selection mode to least request LB #31507

Conversation

jkirschner-hashicorp
Copy link
Contributor

@jkirschner-hashicorp jkirschner-hashicorp commented Dec 23, 2023

Commit Message: Add FULL_SCAN mode to least request load balancer.

By default, the least request load balancer returns the host with the fewest
active requests from a set of N randomly selected hosts.

This introduces a new "full scan" selection method that returns the host with
the fewest number of active requests from all hosts. If multiple hosts are
tied for "least", one of the tied hosts is randomly chosen.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:

Added selection_method option to the least request load balancer. If set to FULL_SCAN, Envoy will select the host with the fewest active requests from the entire host set rather than choice_count random choices.

Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]


This PR is a successor to #31146 and other efforts related to added a full scan mode to the least request load balancer.

This approach preserves automatically using full scan mode when choice_count > number_of_hosts. That functionality was previously removed in #30794 because it caused the first host to always be selected if cluster stats were disabled. However, that justification no longer applies because the start index of the full scan is now randomized.

Many thanks to @barroca for initiating this effort and getting it most of the way to the finish line.

Copy link

Hi @jkirschner-hashicorp, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #31507 was opened by jkirschner-hashicorp.

see: more, trace.

Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #31507 was opened by jkirschner-hashicorp.

see: more, trace.

@jkirschner-hashicorp
Copy link
Contributor Author

@wbpcode , @tonya11en : I realize I'll need to resolve the changelogs conflict and the commit sign-off. This is otherwise ready for your feedback.

google.protobuf.BoolValue enable_full_scan = 5;
// Method for selecting the host set from which to return the host with the fewest active requests.
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).
SelectionMethod selection_method = 5 [(validate.rules).enum = {defined_only: true}];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any issues with modifying the meaning of field 5 when it was never included in an Envoy release (AFAIK)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem. The xDS protos are not used exclusively for Envoy, and they do get picked up by other data planes (e.g., gRPC) in between Envoy releases.

Please leave the existing field as-is, although you can mark it as deprecated if you want. The new field should be added with a new number, and you can document that the old field is ignored if the new field is set.

// The default number of choices for the LEAST_REQUEST policy is 2 hosts, if the number of hosts
// is equal to the number of choices, a full scan happens instead, this means that the same host
// will be chosen.
if (is_hash_lb_ || (GetParam() == envoy::config::cluster::v3::Cluster::LEAST_REQUEST)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: check whether this patch from the original enable full scan PR is actually still needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing. No longer applicable, since full scan mode will require opt-in in all cases

@adisuissa
Copy link
Contributor

Please fix DCO.
Assigning Tony for a first-pass review.
/assign @tonya11en

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Lots of comments, but only the comments about not changing existing LB behavior are important. The LB's host selection behavior should only change if the FULL_SCAN method is set. The original full scan PR had to be reverted because it did this (see #31146 (review)).

// Available methods for selecting the host set from which to return the host with the
// fewest active requests.
enum SelectionMethod {
POWER_OF_N_CHOICES = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this something like N_CHOICE. The "power of" phrasing just comes from the title of one of the academic papers discussing this.

// The number of random healthy hosts from which the host with the fewest active requests will
// be chosen. Defaults to 2 so that we perform two-choice selection if the field is not set.
// Ignored if ``selection_method == FULL_SCAN``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future-proofing this comment, phrase it something like:
Only applies to the N_CHOICE selection method.

Just in case another selection method shows up.

message LeastRequest {
// Available methods for selecting the host set from which to return the host with the
// fewest active requests.
enum SelectionMethod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some 1-liners explaining what it means to set these? I don't think it's explained anywhere what a FULL_SCAN method means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some guidance/recommendation on when full scan should be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barroca : Can you comment on your use case for full scan mode? I know it was different than mine. I'd like to mention both if applicable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have any use case for it :) I pickup the work since it looked simple enough for a learning opportunity to contribute to the project.

The use case I was trying to work on is when the number of hosts is very small and there is a probability that the selection of random hosts will lead to choosing the same one, or not choosing the one with least requests and leaving the load unbalanced.

But this is mostly true when we are dealing with small number of requests, because when you consider high throughput the probability of repeating random selection decreases and we balance out requests between hosts (according the paper used for the original algorithm).

The second implementation, instead of doing a full scan from the same index from the host list was to randomise the start index so we reduce even more the chance of choosing the same index when the number of requests are equal.

I guess the main usecase of a full scan is to guarantee that the request is sent to the host with least requests (which can be beneficial for some use cases like evenly splitting work on a map reduce).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a line that explains N_CHOICES is best for most scenarios, and also explained the niche scenarios in which FULL_SCAN may be preferable. Let me know what you think, @tonya11en and @ramaraochavali. (I think this is the last outstanding review comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

// used to select the one with least requests instead of using random choices.
google.protobuf.BoolValue enable_full_scan = 5;
// Method for selecting the host set from which to return the host with the fewest active requests.
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not say what choice_count defaults to, since it's included in that field. If the default changes, someone would have to change this comment too.

I'd also move the explanation of what the POWER_OF_N_CHOICES means to the enum definition.

Comment on lines 76 to 81
- area: upstream
change: |
Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least
Request load balancer policy to be unfair when the number of hosts are very small, when the number
of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we
perform a full scan on it to choose the host with least requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is supposed to be included?

source/common/upstream/load_balancer_impl.h Show resolved Hide resolved
Comment on lines 1306 to 1311
envoy::extensions::load_balancing_policies::least_request::v3::LeastRequest_SelectionMethod_FULL_SCAN;
if (has_full_scan_configured || (hosts_to_use.size() <= choice_count_)) {
// Choose a random index to start from preventing always picking the first host in the list.
const int rand_idx = random_.random() % hosts_to_use.size();
for (unsigned long i = 0; i < hosts_to_use.size(); i++) {
const HostSharedPtr& sampled_host = hosts_to_use[(rand_idx + i) % hosts_to_use.size()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to change the default behavior of the LB if FULL_SCAN is not set, so I'd remove the part that checks hosts_to_use.size() <= choice_count_. Otherwise, this will break things for a lot of folks that depend on the subtle differences in selection probabilities when using the n-choice selection.

I think I go into the details in #31146 (review).

// - the number of choices is equal to or larger than the number of hosts.
const auto has_full_scan_configured = selection_method_ ==
envoy::extensions::load_balancing_policies::least_request::v3::LeastRequest_SelectionMethod_FULL_SCAN;
if (has_full_scan_configured || (hosts_to_use.size() <= choice_count_)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pull this block out into its own function? It would probably look nice if you had a switch statement that calls the appropriate selection function.

Comment on lines 2787 to 2799

// Host weight is 1.
{
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2));
EXPECT_EQ(hostSet().healthy_hosts_[0], lb_.chooseHost(nullptr));
}

// Host weight is 100.
{
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2));
EXPECT_EQ(hostSet().healthy_hosts_[0], lb_.chooseHost(nullptr));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my comments about only changing behavior when FULL_SCAN is set, I don't think you need to modify any existing unit tests. All that should be needed is a new one that sets the new selection method and verifies sane behavior.

You might even be able to get away with unit testing just the selection method if it's in its own function.

@tonya11en
Copy link
Member

To fix the DCO check, you'll need all your commits to be made via git commit -s so you have the "signed off by..." at the bottom. You can just squash them all together, amend the commit, and force push it to your branch. That should do the trick.

Comment on lines 1308 to 1309
// Choose a random index to start from preventing always picking the first host in the list.
const int rand_idx = random_.random() % hosts_to_use.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still going to be problematic as far as selection probabilities go. Consider some host vector with the following weights:

[9, 9, 1, 1, 1]

Choosing a random index to start the scan from would still choose the first host 80% of the time. The host at index 2 will only be picked 20% of the time, which seems unintuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you walk me through why that happens?

I thought hosts_to_use.size() would be 5, thinking it was just a std::vector. Your comment makes me think something else is happening, like hosts_to_use.size() returns a sum of weights? But presumably the array is still of size 5, so I shouldn't be able to index past [4] in the for-loop that iterates through hosts. I feel like I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the issue is with the active request counts, not weight. Problem is the same though.. see below.

So, suppose you have active request counts [9, 9, 9, 1, 1]. When picking a host, you're choosing a random index and full scanning through all 5 hosts. This implementation is keeping the host with the lowest active request count, so if there's a tie between hosts it will still choose the first one it finds. In this example, the only time you'd choose the host at index 4 is if your random index lands on it first. Otherwise, you'd choose the host at index 3, since 80% of the time it will be the first host with the lowest active rq count. Does that make sense?

I applied your patch and hacked together a simulation to show what I mean. For the example above, here are results:

n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0

I ran this for a few different combinations of active request counts and the results are consistent with my explanation above. Dig it:

[ RUN      ] LeastRequestLoadBalancerWeightTest.TallenHax
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200265
: weight=1 active_rqs=1, pick_ratio=0.200148
: weight=1 active_rqs=1, pick_ratio=0.199813
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.399774
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.199856
: weight=1 active_rqs=1, pick_ratio=0.600027
: weight=1 active_rqs=1, pick_ratio=0.200117
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=1
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200368
: weight=1 active_rqs=1, pick_ratio=0.199323
: weight=1 active_rqs=1, pick_ratio=0.200177
: weight=1 active_rqs=1, pick_ratio=0.199968
: weight=1 active_rqs=1, pick_ratio=0.200164

It can be easily fixed, but it's gonna make the implementation a bit less performant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. If there's a tie for "least requests", then hosts with more non-least requests before them in the array will get more selection weight.

It can be easily fixed, but it's gonna make the implementation a bit less performant.

I haven't thought about the best way to easily fix this yet. If you have a suggestion in mind, I'm open to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some pseudo-code:

std::vector<HostPtr> candidates;
candidates.reserve(hosts_to_use.size());
for (const auto& host : hosts_to_use) {
  const bool reset_candidates = 
      candidates->empty() || (host.active_rqs < candidates.front()->active_rqs);
  if (reset_candidates) {
    // Found a new minimum active request count.
    candidates.clear();
    candidates.push_back(host);
  } else if (host.active_rqs == candidates.front()->active_rqs) {
    // Found another host with the minimum number of active requests.
    candidates.push_back(host);
  }
}

return candidates[rand_idx % candidates.size()];

Basically just find all the hosts that share the smallest number of active requests and choose one of them at random. There's probably some fancy modern std::algorithm function that does this in a more elegant way, but you get the idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have to change the random start index to be decoupled from the random number used to select the prime. I'll ping you again here when that's done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - decoupled the random number used to select a large prime from the random number used to select the pre-increment array index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't give this the attention it deserves today, but I'll take a look and get back to you by Tuesday (because of the holiday).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally fine - thanks for expectation setting :) Enjoy the time away! We'll re-engage sometime next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind, this approach will always perform somewhere between the "random start index, iterate by 1" approach and the pseudo-code you wrote above in terms of following the pick ratios. In most cases, I think it will get close enough to the ideal that your pseudo-code achieves. However, there will be some cases where it looks closer to the "random start index, iterate by 1" approach.

The worst case that comes to mind is with 6 hosts that have active request counts of [9, 9, 9, 1, 1, 1]. 6 hosts only has two possible increment values (1 and 5 are the only coprimes with 6 that are <6). 6 is the highest count with only 2 possible increment values (that happen to be +1 and -1 increment). In that case, the pick ratios will be:

  • Indices 0-2: 0%
  • Index 3: 41.66% (5/12)
  • Index 4: 16.66% (2/12)
  • Index 5: 41.66% (5/12)

The ideal pick ratio is that indices 3-5 will all get 33.33%.

The pick ratios should approach the ideal as the number of possible increment values increases (coprimes with the host count). Here's what the possible increment values look like by host count:

  • 1: 0
  • 2: 1
  • 3: 1, 2
  • 4: 1, 3
  • 5: 1, 2, 3, 4
  • 6: 1, 5
  • 7: 1, 2, 3, 4, 5, 6
  • 8: 1, 3, 5, 7
  • 9: 1, 2, 4, 5, 7, 8
  • 10: 1, 3, 7, 9

If it's deemed critical that the selection match the ideal pick ratio in all cases, then this approach is not great. However, we could potentially have two different modes in that case... one that is FULL_SCAN that will follow the ideal pick ratio in all cases, and one that is FULL_SCAN_BIASED with much lower memory overhead with that trade-off that it doesn't always follow the ideal pick ratio (and how much it deviates depends on the host count, how many ties for "least requests" exist, and how those ties are distributed in the host array).

@wbpcode
Copy link
Member

wbpcode commented Jan 3, 2024

Before the implementation, I think the first thing here is that we shouldn't change the API directly even it's marked as not-implemented. The API is introduced in the #29873 and we have missed the window in which we can change the API.

cc @envoyproxy/api-shepherds

@jkirschner-hashicorp
Copy link
Contributor Author

@wbpcode : Making sure I understand: I need to leave enable_full_scan as the not-implemented field 5 of the protobuf, making selection_method a new field 6?

@wbpcode
Copy link
Member

wbpcode commented Jan 3, 2024

Could you just implement this feature with the exist API? is there any case that the bool connot cover? 🤔

@jkirschner-hashicorp
Copy link
Contributor Author

Could you just implement this feature with the exist API?

Yes. However, I was asked by @tonya11en to use an enum instead of a bool to define the selection method.

Let me know what field type you and @tonya11en agree upon, and then I'll proceed.

@jkirschner-hashicorp
Copy link
Contributor Author

If we're stuck with the enable_full_scan field anyways (because we can't remove it), then perhaps it would be best to just use that field? And if a 3rd selection method is added in the future, the enable_full_scan field can be deprecated at that time (and an enum can be added)?

Just a thought. The enum approach is already implemented, so both are readily available approaches. I defer to you both, @wbpcode and @tonya11en.

@tonya11en
Copy link
Member

@wbpcode if it's marked [#not-implemented-hide:] how would anything be affected by its removal? The docs have excluded any mention of the field, so there should be no valid uses in the wild.

@wbpcode
Copy link
Member

wbpcode commented Jan 4, 2024

@tonya11en Envoy is not the only client that implement xDS. And API is the API, even now we regret it, we shouldn't change it directly after the 2-weeks window.

If we want more options in future, deprecate it and add new field should be fine according to our API policy. But now, I didn't see the necessity.

@htuch
Copy link
Member

htuch commented Jan 4, 2024

I had also taken [#not-implemented-hide:] to mean WiP, and just checked what our style / version guides say. It's mentioned in https://github.com/envoyproxy/envoy/blob/main/api/STYLE.md as being Envoy-specific, but let's check with @markdroth on this one. We should clarify in the API style guides the semantics of this annotation in a multi-dataplane world.

@wbpcode
Copy link
Member

wbpcode commented Jan 4, 2024

Long time ago, I did a similar discussion with @markdroth , about a field that not be implemented in the envoy, but had been used in the grpc.

Considering the grpc is the only one well-known non-envoy xds client, if @markdroth agree, it's possible to treat this field as a special case.

But I think we still need a reasonable reason about this change.


The only one reason that I can think of now is that we can add a new combine mode by extending enum in the future to enable full scan only when the host size is less than the choice count.
Not sure if this could be a appropriate reason. : 🤔

@tonya11en
Copy link
Member

I'd be really surprised if someone already added support for that boolean in other data planes, like gRPC. In the meantime, can we just assume it's fine to remove but wait to merge until we're sure?

@wbpcode
Copy link
Member

wbpcode commented Jan 5, 2024

I'd be really surprised if someone already added support for that boolean in other data planes, like gRPC.

Yeah. Although basically we can ensure this field haven't been used. But the percent is not 100%, so, the only way to keep the API stable is don't do that. This is why we always need to be careful to the API change.

In the meantime, can we just assume it's fine to remove but wait to merge until we're sure?

Okay. We can review the API and code first. But my bottom line is to use a new field ID rather than changing current field directly. And finally, we need a LGTM from @markdroth

@jkirschner-hashicorp
Copy link
Contributor Author

Reading between the lines of this conversation, it sounds like:

  • We want to use an enum rather than a Bool field
  • The enum field should be a new field, leaving the previous Bool for as "not implemented"

Is this accurate? If so, I'll proceed with making the changes from review, though I'd still benefit from a response to this comment thread.

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I elaborated on my earlier comment about the selection probabilities being wonky. If you want to see the test I hacked together, it's here: tonya11en@b12317f

Let me know if anything still needs clarification. Thanks for being patient throughout all this.

Comment on lines 1308 to 1309
// Choose a random index to start from preventing always picking the first host in the list.
const int rand_idx = random_.random() % hosts_to_use.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the issue is with the active request counts, not weight. Problem is the same though.. see below.

So, suppose you have active request counts [9, 9, 9, 1, 1]. When picking a host, you're choosing a random index and full scanning through all 5 hosts. This implementation is keeping the host with the lowest active request count, so if there's a tie between hosts it will still choose the first one it finds. In this example, the only time you'd choose the host at index 4 is if your random index lands on it first. Otherwise, you'd choose the host at index 3, since 80% of the time it will be the first host with the lowest active rq count. Does that make sense?

I applied your patch and hacked together a simulation to show what I mean. For the example above, here are results:

n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0

I ran this for a few different combinations of active request counts and the results are consistent with my explanation above. Dig it:

[ RUN      ] LeastRequestLoadBalancerWeightTest.TallenHax
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200265
: weight=1 active_rqs=1, pick_ratio=0.200148
: weight=1 active_rqs=1, pick_ratio=0.199813
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.399774
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=0.199856
: weight=1 active_rqs=1, pick_ratio=0.600027
: weight=1 active_rqs=1, pick_ratio=0.200117
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.199839
: weight=1 active_rqs=1, pick_ratio=0.800161
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=9, pick_ratio=0
: weight=1 active_rqs=1, pick_ratio=1
: weight=1 active_rqs=9, pick_ratio=0
================================
n=1000000
: weight=1 active_rqs=1, pick_ratio=0.200368
: weight=1 active_rqs=1, pick_ratio=0.199323
: weight=1 active_rqs=1, pick_ratio=0.200177
: weight=1 active_rqs=1, pick_ratio=0.199968
: weight=1 active_rqs=1, pick_ratio=0.200164

It can be easily fixed, but it's gonna make the implementation a bit less performant.

@wbpcode wbpcode assigned markdroth and unassigned wbpcode Jan 8, 2024
Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this @jkirschner-hashicorp, docs review below

// used to select the one with least requests instead of using random choices.
google.protobuf.BoolValue enable_full_scan = 5;
// Method for selecting the host set from which to return the host with the fewest active requests.
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Defaults to POWER_OF_N_CHOICES, where N is determined by `choice_count` (2 by default).
// Defaults to ``POWER_OF_N_CHOICES``, where ``N`` is determined by ``choice_count`` (2 by default).

- area: upstream
change: |
Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least
Request load balancer policy to be unfair when the number of hosts are very small, when the number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break this sentence up, currently its not grammatically correct and the meaning is hard to decipher

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be "fixed" by removing the entry entirely. Since we have decided to make full scan mode always require opt in (even if choice count > # hosts), this is no longer even an accurate/applicable entry.

change: |
Fixed a reported issue (https://github.com/envoyproxy/envoy/issues/11004) that causes the Least
Request load balancer policy to be unfair when the number of hosts are very small, when the number
of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of hosts is smaller than the choice_count, instead of randomly selection hosts from the list, we
of hosts is smaller than the ``choice_count``, instead of randomly selection hosts from the list, we

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or better, use the link you have used below

- area: upstream
change: |
Added :ref:`selection_method <envoy_v3_api_msg_extensions.load_balancing_policies.least_request.v3.LeastRequest>`
option to the least request load balancer. If set to FULL_SCAN, or if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
option to the least request load balancer. If set to FULL_SCAN, or if
option to the Least Request load balancer. If set to ``FULL_SCAN``, or if

@phlax
Copy link
Member

phlax commented Jan 25, 2024

for sure - as posted above - both are definitely unrelated - just an ongoing battle to reduce/tackle flakiness

@tonya11en
Copy link
Member

I went ahead and ran the test 10k times and it seems we're all good:

 txallen@stormbeard  ~/src/envoy   jkirschner-hashicorp-least_request_lb_enable_full_scan_mode  bazel test --cache_test_results=no //test/common/upstream:load_balancer_impl_test --test_arg=--gtest_filter=LeastRequestLoadBalancerTest.FullScanMultipleHostsWithLeastRequests --runs_per_test=10000 --local_test_jobs=1
INFO: Invocation ID: 4477bf37-d686-4220-a0a4-32f870c78a9c
INFO: Analyzed target //test/common/upstream:load_balancer_impl_test (0 packages loaded, 532 targets configured).
INFO: Found 1 test target...
Target //test/common/upstream:load_balancer_impl_test up-to-date:
  bazel-bin/test/common/upstream/load_balancer_impl_test
INFO: Elapsed time: 44.994s, Critical Path: 3.31s
INFO: 10001 processes: 1 internal, 10000 remote.
INFO: Build completed successfully, 10001 total actions

Executed 1 out of 1 test: 1 test passes.

🎉

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>
@jkirschner-hashicorp
Copy link
Contributor Author

@tonya11en : Let me know if there are any more actions for me to take here. It looks like some CI steps failed, but don't seem to be related to changes introduced by this PR. Thank you for your input, feedback, and patience throughout!

@tonya11en
Copy link
Member

/retest

@jkirschner-hashicorp there's nothing for you to do at this point. We just need a senior maintainer to review before the PR is eligible for a merge.

Will rerun the tests in the meantime and cross fingers.

@tonya11en
Copy link
Member

/retest

@markdroth
Copy link
Contributor

/lgtm api

@zuercher
Copy link
Member

/retest

zuercher
zuercher previously approved these changes Jan 30, 2024
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's up with the tests, but otherwise looks good.

@jkirschner-hashicorp
Copy link
Contributor Author

There's now a merge conflict with the changelogs file. I suppose I can fix that, and hopefully that kicks off a new build that passes the Envoy/Windows stage.

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>
@jkirschner-hashicorp
Copy link
Contributor Author

After addressing the changelog merge conflict (and making no other changes), the Envoy/Windows stage now passes, but some other stages are now failing for reasons that seem unrelated to this PR.

@tonya11en
Copy link
Member

/retest

@jkirschner-hashicorp
Copy link
Contributor Author

It looks like the only failure on this run was related to code coverage. But I'm not sure what this PR has to do with source/common/io. Any ideas?

image

@tonya11en
Copy link
Member

@phlax do you know what is going on with the failing coverage test? It doesn't have anything to do with this PR.

@alyssawilk
Copy link
Contributor

I thikn this needs a main merge to fix coverage (and current.yaml)
/wait

Signed-off-by: Jared Kirschner <jkirschner@hashicorp.com>
@jkirschner-hashicorp
Copy link
Contributor Author

Fingers crossed that the main merge fixes the code coverage strangeness.

@jkirschner-hashicorp
Copy link
Contributor Author

Seems like the coverage check passed this time.

However, one CI test flake occurred from the list of reported flakes (#32132):

image

@tonya11en
Copy link
Member

/retest

@jkirschner-hashicorp
Copy link
Contributor Author

Woo! All checks have passed!

@zuercher zuercher merged commit 1995d92 into envoyproxy:main Feb 9, 2024
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.