lb-edf: fix lb initialization to choose from the correct set of weighted hosts #29953

adisuissa · 2023-10-04T17:22:46Z

Commit Message: lb-edf: fix lb initialization to choose from the correct set of weighted hosts
Additional Description:
Fixing the initialization of EDF based LB algorithms. Prior to this PR, the initialization would randomize a number between [0, #hosts). However, it should've randomize by the number of entries in the "pick list".
For example, if there are 2 hosts, host_A with weight 99 and host_B with weight 1, the randomization will land in either index 0 or 1, whereas it should have been between 0 and 99 (as there are 100 entries in the "pick list").
This may cause a skew in the traffic distribution, especially in a fleet of Envoy, and when Envoy receive periodic assignment updates.

Risk Level: Medium - impacts the initialization of LB algorithms
Testing: Added unit tests
Docs Changes: N/A
Release Notes: Added a release note as well as the envoy.reloadable_features.edf_lb_scheduler_init_fix runtime guard.
Platform Specific Features: N/A

…ted hosts Signed-off-by: Adi Suissa-Peleg <adip@google.com>

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

repokitteh-read-only · 2023-10-04T17:22:53Z

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #29953 was opened by adisuissa.

see: more, trace.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

wbpcode · 2023-10-10T02:12:12Z

source/common/upstream/load_balancer_impl.cc

+        const uint32_t host0_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[0])));
+        const uint32_t host1_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[1])));


My only concern is what will happen if the hostWeight return zero? Or will this happen?

Others LGTM, :)

The host weight should be at least 1.
There's some normalization that is done on the host weight, but I don't think it can become 0.

iirc, some lb will add a factor to the original weight. (Like lbs that enabled slow start)

but finally it should large than 0. Let's do a check. 🤔

I'll add a lower bound to the normalizer.
@nezdolik could slow start set host weights to 0?

Slow start code uses max(calculated_weight, min_weight_percentage) for weight, min_weight_percentage defaults to 0.1. Am not by computer now, but could paste relevant links here later today.

This piece of code applies slow start factor in most complex logical if branch (in other logical branches original host weight is returned):

return host_weight * std::max(applyAggressionFactor(time_factor, aggression), slow_start_min_weight_percent_);

If original host_weight is 0 computation result will be 0

applyAggressionFactor(time_factor, aggression) cannot be zero, time_factor is calculated as:

auto time_factor = static_cast<double>(std::max(std::chrono::milliseconds(1).count(), in_healthy_state_duration.count())) / slow_start_window_.count();

to get the closest value to 0, slow_start_window_ divisor should be max value of std::chrono::milliseconds and dividend 1 millisecond, that value is still not 0 and fits into 4 bytes (double type size is 8 bytes, so should be no overflow).
3. applyAggressionFactor is calculated as:

if (aggression == 1.0 || time_factor == 1.0) { return time_factor; } else { return std::pow(time_factor, 1.0 / aggression); }

cannot evaluate to 0 as time_factor cannot be 0 (proven in step2). Min possible value for time_factor in power of minimal possible value for 1.0/aggression (which is 1/double_max_value) still did not yield overflow/zero.
4. slow_start_min_weight_percent_ can be set to 0 according to api.proto, but we have a max guard and in that case applyAggressionFactor(time_factor, aggression) will be selected as max, and is proven it cannot be zero in step 2.

I relied on various online tools for floating point precision arithmetics and overflow testing like https://www.eeweb.com/tools/online-scientific-calculator/ or https://www.h-schmidt.net/FloatConverter/IEEE754.html. With 99% confidence slow start factor function cannot return 0 :) (unless original host weight is zero)

Thank you @nezdolik for the thorough verification.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

nezdolik · 2023-10-10T18:45:50Z

source/common/upstream/load_balancer_impl.cc

+        // Compute the greatest common divisor of all the weights, to normalize
+        // the sum of weights to a smaller number. This can be avoided, but will
+        // result in higher numbers and more calls to `pickAndAdd()`.
+        const uint32_t host0_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[0])));


wouldn't using ceil skew the proportion of host weights in some cases? E.g. we have host1 with weight 0.1 and host2 with weight 1 (proportion 1:10), by using ceil that proportion would turn into 1:1 and make host1 as equally likely to be picked as host2.

Yeah, considering the additional factor like the slow start, the hostWeight may return a very minor value like 0.1.

And should we limit the max number of this initial randomize (or add a switch to it)? If weight difference of different endpoints is very very large, like 1 and 9999, then the initial randomize number will very large. cc @adisuissa

I took that into consideration when looking for how to fix the bug.
I'm not aware of a way to compute a gcd of non-integers. The alternative is to keep the bug as is (which also impacts a case of 2 hosts one with weight=1, and the other with weight=0.1). If you know of an efficient way to compute the length of the cycle with non-integer weights, please let me know.

The right way to fix this is probably convert this to static stride-scheduling (grpc implementations). However that will be a substantial change, and if added as an alternative, it will still not solve the current bug.

RE limiting the max number of randomized, I can probably add that somewhere (bootstrap?), but that seems more like a feature that will never go away to solve a bug.

also tagging @tonya11en in case he has great ideas (he usually does :) )

cc @adisuissa is there any input about the problem that @nezdolik raised. Basically, I think this is only point that may block this PR. (Ignore my suggestion about the max number of randomize, we can try to resolve it until it actually become a problem)

nezdolik · 2023-10-10T18:47:42Z

test/common/upstream/load_balancer_impl_test.cc

+  // random value, 6 times the first host will be chosen, 3 times the second
+  // host will be chosen, and 1 time the third host will be chosen.
+  hostSet().healthy_hosts_ = {
+      makeTestHost(info_, "tcp://127.0.0.1:80", simTime(), 6),


could you also test with non integer weights?

i see that original weights are integer only, but perhaps testing with slow start config is worth to cover as well.

alyssawilk · 2023-10-23T18:38:07Z

Looks like this needs a main merge but @wbpcode is this waiting on you otherwise?

wbpcode · 2023-10-24T02:24:21Z

/wait-any

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

adisuissa · 2023-11-15T15:14:20Z

Apologies for the late reply here.
I've modified the code to only work when there are no hosts in slow-start (and so host weights are integers).
@wbpcode @nezdolik PTAL.

nezdolik · 2023-10-11T20:38:05Z

source/common/upstream/load_balancer_impl.cc

+        // Compute the greatest common divisor of all the weights, to normalize
+        // the sum of weights to a smaller number. This can be avoided, but will
+        // result in higher numbers and more calls to `pickAndAdd()`.
+        const uint32_t host0_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[0])));


also tagging @tonya11en in case he has great ideas (he usually does :) )

nezdolik · 2023-11-15T15:22:21Z

source/common/upstream/load_balancer_impl.cc

-      for (uint32_t i = 0; i < seed_ % hosts.size(); ++i) {
-        auto host =
-            scheduler.edf_->pickAndAdd([this](const Host& host) { return hostWeight(host); });
+    if (Runtime::runtimeFeatureEnabled("envoy.reloadable_features.edf_lb_scheduler_init_fix") &&


One more aspect to consider, is that LeastRequest loadbalancer can also scale host weight based on number of active requests: https://github.com/envoyproxy/envoy/blob/main/source/common/upstream/load_balancer_impl.cc#L1276

That's a good point. Let me address that.

tonya11en

Hey, sorry I missed this last month (was on vacation). This problem looks eerily similar to the "first pick problem" that was discussed in #14597 and summarized in #14681. Hopefully I'm understanding this problem correctly. In what I linked, the gist of the problem was that in cases where you have 2 hosts (weight 99 and weight 1) that are picked after initializing the EDF scheduler, it would always pick the host with weight 99. I did a lot of investigation into that and landed on an alternative scheduling discipline that was merged in #14681 called WRSQ.

There were some snags that it appears you ran into as well with scenarios where the effective host weights are changing after each selection (like in slow-start and least request). We did conclude that there's no good way to maintain correctness in the selection probabilities without a linear sweep of the host set to perform a weight random selection.

I never got around to plugging in the WRSQ scheduler after implementing it and merging, so if it solves your problem you should consider using it here for the WRR cases. LMK if it makes sense or if I should write up something a bit more thorough.

adisuissa · 2023-11-15T18:17:01Z

Hey, sorry I missed this last month (was on vacation). This problem looks eerily similar to the "first pick problem" that was discussed in #14597 and summarized in #14681. Hopefully I'm understanding this problem correctly. In what I linked, the gist of the problem was that in cases where you have 2 hosts (weight 99 and weight 1) that are picked after initializing the EDF scheduler, it would always pick the host with weight 99. I did a lot of investigation into that and landed on an alternative scheduling discipline that was merged in #14681 called WRSQ.

There were some snags that it appears you ran into as well with scenarios where the effective host weights are changing after each selection (like in slow-start and least request). We did conclude that there's no good way to maintain correctness in the selection probabilities without a linear sweep of the host set to perform a weight random selection.

I never got around to plugging in the WRSQ scheduler after implementing it and merging, so if it solves your problem you should consider using it here for the WRR cases. LMK if it makes sense or if I should write up something a bit more thorough.

Thanks @tonya11en for this, I wasn't aware of WRSQ. I'm not sure I understand how it solves the problem: IIUC the call to refresh() will still call pickAndAdd() twice in your example, and prepick_queue_ population seems to only be called on peekAgain().

Internally we'd like to end up with an algorithm that may be a bit lighter (static-stride-scheduler, see grpc implementation). However, as you pointed out, this will still not solve the underlying issue of non-integer weights.

I wonder if the right approach will be to separate the cases where weights are non-integers, as these also have some CPU performance.

tonya11en · 2023-11-15T19:05:15Z

I'm not sure I understand how it solves the problem: IIUC the call to refresh() will still call pickAndAdd() twice in your example, and prepick_queue_ population seems to only be called on peekAgain().

My understanding of the problem here is that the selection probability of the "first pick" after the EDF scheduler is initialized is not correct, given the weights of the hosts in the set. This was measured in an earlier simulation with 2 hosts, weight 99 and weight 1. The weight 99 host was picked first every single time:

[ RUN      ] WRRLoadBalancerTest.1stPickEDF
url:10.0.0.0:6379, weight:99, hits:10000, observed_hit_pct:100

Using WRSQ made it so that the first pick honored the selection probabilities warranted by the host weights:

[ RUN      ] WRRLoadBalancerTest.1stPickWRSQ
url:10.0.0.0:6379, weight:99, hits:98982, observed_hit_pct:98.982
url:10.0.0.0:6379, weight:1, hits:1018, observed_hit_pct:1.018

Internally we'd like to end up with an algorithm that may be a bit lighter (static-stride-scheduler, see grpc implementation). However, as you pointed out, this will still not solve the underlying issue of non-integer weights.

What does "lighter" mean in this case? The tradeoff with using the WRSQ scheduler is that it has to do a linear sweep of all the hosts every time one of their weights are updated, which is what makes it unsuitable for LRLB. Outside of that case, it's performance characteristics are at least as fast as EDF.

I wonder if the right approach will be to separate the cases where weights are non-integers, as these also have some CPU performance.

This would be in all the cases where the weights are dynamically updated, like in least request LB or slow-start, right? There's no other way to handle this afaik.

tonya11en · 2023-11-15T19:14:23Z

I should clarify that I'm not trying to push WRSQ on you here. Just thought it might be useful context for you.

adisuissa · 2023-11-15T20:23:55Z

My understanding of the problem here is that the selection probability of the "first pick" after the EDF scheduler is initialized is not correct, given the weights of the hosts in the set. This was measured in an earlier simulation with 2 hosts, weight 99 and weight 1. The weight 99 host was picked first every single time:
[ RUN      ] WRRLoadBalancerTest.1stPickEDF
url:10.0.0.0:6379, weight:99, hits:10000, observed_hit_pct:100
Using WRSQ made it so that the first pick honored the selection probabilities warranted by the host weights:
[ RUN      ] WRRLoadBalancerTest.1stPickWRSQ
url:10.0.0.0:6379, weight:99, hits:98982, observed_hit_pct:98.982
url:10.0.0.0:6379, weight:1, hits:1018, observed_hit_pct:1.018

Nice, sorry if I missed these tests in the replies to the original issues.
I cannot find the tests (WRRLoadBalancerTest) in the codebase. Can you point me to the correct file?

Internally we'd like to end up with an algorithm that may be a bit lighter (static-stride-scheduler, see grpc implementation). However, as you pointed out, this will still not solve the underlying issue of non-integer weights.

What does "lighter" mean in this case? The tradeoff with using the WRSQ scheduler is that it has to do a linear sweep of all the hosts every time one of their weights are updated, which is what makes it unsuitable for LRLB. Outside of that case, it's performance characteristics are at least as fast as EDF.

"Lighter" - uses a single integer to decide which host to pick next. The integer is incremented every time a host is chosen.
The number of increments is bounded by the number of hosts (there's an underlying assumption is that O(N) integer arithmetics will be at least as fast as using data-structures such as a heap/queue).

I wonder if the right approach will be to separate the cases where weights are non-integers, as these also have some CPU performance.

This would be in all the cases where the weights are dynamically updated, like in least request LB or slow-start, right? There's no other way to handle this afaik.

Yeah, not sure what's the best approach here.

KBaichoo · 2023-11-21T15:17:52Z

friendly ping @tonya11en , I think this is waiting on your response.

tonya11en · 2023-11-21T21:05:00Z

I cannot find the tests (WRRLoadBalancerTest) in the codebase. Can you point me to the correct file?

This was just an existing test that I hacked together a long time ago. The patch is in tonya11en#1 if you wanted to take a look at what I was doing there.

Regarding the weighted random selections with non-integral weights, I'm not sure what folks here want to do about that problem (block PR, etc.). There's not a sub-linear selection algorithm (that I'm aware of) which is resilient to constantly changing and fractional weights like we see in the least request load balancer AND doesn't fall victim to this "first-pick problem".

I wasn't aware that I'm blocking progress here, so let me know what you need on my end.

yanavlasov · 2023-11-28T19:46:04Z

/wait

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

adisuissa · 2023-11-30T17:11:18Z

Assigning Harvey as a senior-maintainer and may have a better way to solve this issue.
/assign @htuch
TL;DR - this PR attempts to solve a bug where the value of the initial number of iterations to emulate a "pick" when hosts have different weights is incorrect.
Unfortunately it works only for integer-precision weights, but not for double-precision weights (Slow-Start, Least-Request).
The alternatives are:

Not solving the bug (known issue - close the PR).
Fixing the issue for non-Slow-start non-Least-Request (this PR).
Adding a new load-balancer that is simpler but doesn't support double-precision weights (this could be planned concurrently to this work).
Someone can come up with a better solution that will work for double-precision weights :-)

htuch · 2023-12-06T07:06:49Z

It feels like WRSQ should be where we throw effort going forward, and if there is a lightweight way to "fix" things here for integer only weights, that seems fine as well.

tonya11en · 2023-12-06T18:23:40Z

It feels like WRSQ should be where we throw effort going forward, and if there is a lightweight way to "fix" things here for integer only weights, that seems fine as well.

Just FYI, the WRSQ scheduler implementation was already merged a while ago. It was just never plumbed into the relevant LBs (never finished review process for #18020).

adisuissa · 2024-01-04T14:40:55Z

/wait

…31592) This is a follow up on #29953, with some insights about an additional scenario that has a similar problem. The EDF scheduler is used in 2 levels of Zone-Aware LB policies: 1. For choosing the locality to use (here for healthy hosts). Note that as opposed to the problem in (2), this also impacts equal-weighted localities. 2. For choosing the host within the chosen locality (here). Prior suggestions (WRSQ, #29953) focused on fixing level (2), and required converting the weights to integer-precision before performing the LB policy. While this may work for (2), assuming one can convert the weights to integers by multiplying by some factor and truncating, for (1) this approach is more challenging as when computing the "effective locality weight" its value is dependent on the over-provisioning factor and the ratio of available hosts to all hosts. An additional benefit is that the current approach can also work with slow-start. Thus, this PR suggests a different mechanism to "perform" some random number of picks when creating the EDF scheduler. The creation process is split into two steps: 1. Estimate a lower-bound on the number of picks each entry will be chosen, and initialize the internal EDF priority-queue accordingly. 2. Perform up to N "pickAndAdd" operations, where N is the number of localities/hosts. Note that this approach is ~equal to performing some P picks from the scheduler where P >> N (~equal up to double precision computation differences and entries order of entries with the same weight). After this PR, the next thing is to plumb the PRNG (or a random value) into the locality-scheduler creation process, and fix the call site for (1). Here's a short doc providing the rationale behind this work: https://docs.google.com/document/d/1i53RYOy8sJUwL6_PSIF4yN39Xc1LvJtZ-18aSBZq0vM/edit?usp=sharing Risk Level: low - code not used by Envoy's main codebase Testing: Added tests to validate equivalence with prior approach. Docs Changes: N/A Release Notes:N/A - will be updated when the Platform Specific Features: N/A Signed-off-by: Adi Suissa-Peleg <adip@google.com>

adisuissa · 2024-01-26T19:38:45Z

Closing this in favor of #31592

adisuissa added 2 commits October 4, 2023 13:06

lb-edf: fix lb initialization to choose from the correct set of weigh…

6028713

…ted hosts Signed-off-by: Adi Suissa-Peleg <adip@google.com>

refactoring test, removing debugs

7b24822

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

zuercher assigned wbpcode Oct 4, 2023

adisuissa added 2 commits October 9, 2023 13:12

Merge remote-tracking branch 'upstream/main' into fix_wrr_lb_init

b977eba

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

typo

6d249b5

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

wbpcode reviewed Oct 10, 2023

View reviewed changes

adisuissa added 2 commits October 10, 2023 13:57

Merge remote-tracking branch 'upstream/main' into fix_wrr_lb_init

20e6ab4

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

ensure minimal normalizer

c4fbb50

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

adisuissa mentioned this pull request Oct 10, 2023

check-format broken when merging latest main and pushing a PR #30066

Closed

nezdolik reviewed Oct 10, 2023

View reviewed changes

repokitteh-read-only bot added the waiting:any label Oct 24, 2023

adisuissa added 2 commits November 15, 2023 14:12

Merge remote-tracking branch 'upstream/main' into fix_wrr_lb_init

4606a58

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

fix will be used only for non-Slow-Start

8a8a9de

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

repokitteh-read-only bot removed the waiting:any label Nov 15, 2023

nezdolik reviewed Nov 15, 2023

View reviewed changes

tonya11en reviewed Nov 15, 2023

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 28, 2023

Merge remote-tracking branch 'upstream/main' into fix_wrr_lb_init

f8f7531

Signed-off-by: Adi Suissa-Peleg <adip@google.com>

repokitteh-read-only bot removed the waiting label Nov 30, 2023

repokitteh-read-only bot assigned htuch Nov 30, 2023

adisuissa mentioned this pull request Jan 2, 2024

edf: adding a construction option with a given number of pre-picks #31592

Merged

repokitteh-read-only bot added the waiting label Jan 4, 2024

adisuissa closed this Jan 26, 2024

		const uint32_t host0_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[0])));
		const uint32_t host1_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[1])));

lb-edf: fix lb initialization to choose from the correct set of weighted hosts #29953

lb-edf: fix lb initialization to choose from the correct set of weighted hosts #29953

Conversation

adisuissa commented Oct 4, 2023

repokitteh-read-only bot commented Oct 4, 2023

wbpcode Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbpcode Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nezdolik Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyssawilk commented Oct 23, 2023

wbpcode commented Oct 24, 2023

adisuissa commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonya11en left a comment

Choose a reason for hiding this comment

adisuissa commented Nov 15, 2023

tonya11en commented Nov 15, 2023

tonya11en commented Nov 15, 2023

adisuissa commented Nov 15, 2023

KBaichoo commented Nov 21, 2023

tonya11en commented Nov 21, 2023

yanavlasov commented Nov 28, 2023

adisuissa commented Nov 30, 2023

htuch commented Dec 6, 2023

tonya11en commented Dec 6, 2023

adisuissa commented Jan 4, 2024

adisuissa commented Jan 26, 2024

wbpcode Oct 10, 2023 •

edited

Loading

wbpcode Oct 10, 2023 •

edited

Loading

nezdolik Oct 10, 2023 •

edited

Loading