Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lb-edf: fix lb initialization to choose from the correct set of weighted hosts #29953

Closed
wants to merge 9 commits into from

Conversation

adisuissa
Copy link
Contributor

Commit Message: lb-edf: fix lb initialization to choose from the correct set of weighted hosts
Additional Description:
Fixing the initialization of EDF based LB algorithms. Prior to this PR, the initialization would randomize a number between [0, #hosts). However, it should've randomize by the number of entries in the "pick list".
For example, if there are 2 hosts, host_A with weight 99 and host_B with weight 1, the randomization will land in either index 0 or 1, whereas it should have been between 0 and 99 (as there are 100 entries in the "pick list").
This may cause a skew in the traffic distribution, especially in a fleet of Envoy, and when Envoy receive periodic assignment updates.

Risk Level: Medium - impacts the initialization of LB algorithms
Testing: Added unit tests
Docs Changes: N/A
Release Notes: Added a release note as well as the envoy.reloadable_features.edf_lb_scheduler_init_fix runtime guard.
Platform Specific Features: N/A

…ted hosts

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #29953 was opened by adisuissa.

see: more, trace.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Comment on lines +1113 to +1114
const uint32_t host0_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[0])));
const uint32_t host1_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[1])));
Copy link
Member

@wbpcode wbpcode Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is what will happen if the hostWeight return zero? Or will this happen?

Others LGTM, :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host weight should be at least 1.
There's some normalization that is done on the host weight, but I don't think it can become 0.

Copy link
Member

@wbpcode wbpcode Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, some lb will add a factor to the original weight. (Like lbs that enabled slow start)

but finally it should large than 0. Let's do a check. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a lower bound to the normalizer.
@nezdolik could slow start set host weights to 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow start code uses max(calculated_weight, min_weight_percentage) for weight, min_weight_percentage defaults to 0.1. Am not by computer now, but could paste relevant links here later today.

Copy link
Member

@nezdolik nezdolik Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code applies slow start factor in most complex logical if branch (in other logical branches original host weight is returned):

      return host_weight * std::max(applyAggressionFactor(time_factor, aggression),
                                    slow_start_min_weight_percent_);
  1. If original host_weight is 0 computation result will be 0
  2. applyAggressionFactor(time_factor, aggression) cannot be zero, time_factor is calculated as:
      auto time_factor = static_cast<double>(std::max(std::chrono::milliseconds(1).count(),
                                                      in_healthy_state_duration.count())) /
                         slow_start_window_.count();

to get the closest value to 0, slow_start_window_ divisor should be max value of std::chrono::milliseconds and dividend 1 millisecond, that value is still not 0 and fits into 4 bytes (double type size is 8 bytes, so should be no overflow).
3. applyAggressionFactor is calculated as:

  if (aggression == 1.0 || time_factor == 1.0) {
    return time_factor;
  } else {
    return std::pow(time_factor, 1.0 / aggression);
  }

cannot evaluate to 0 as time_factor cannot be 0 (proven in step2). Min possible value for time_factor in power of minimal possible value for 1.0/aggression (which is 1/double_max_value) still did not yield overflow/zero.
4. slow_start_min_weight_percent_ can be set to 0 according to api.proto, but we have a max guard and in that case applyAggressionFactor(time_factor, aggression) will be selected as max, and is proven it cannot be zero in step 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I relied on various online tools for floating point precision arithmetics and overflow testing like https://www.eeweb.com/tools/online-scientific-calculator/ or https://www.h-schmidt.net/FloatConverter/IEEE754.html. With 99% confidence slow start factor function cannot return 0 :) (unless original host weight is zero)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nezdolik for the thorough verification.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
// Compute the greatest common divisor of all the weights, to normalize
// the sum of weights to a smaller number. This can be avoided, but will
// result in higher numbers and more calls to `pickAndAdd()`.
const uint32_t host0_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[0])));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't using ceil skew the proportion of host weights in some cases? E.g. we have host1 with weight 0.1 and host2 with weight 1 (proportion 1:10), by using ceil that proportion would turn into 1:1 and make host1 as equally likely to be picked as host2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, considering the additional factor like the slow start, the hostWeight may return a very minor value like 0.1.

And should we limit the max number of this initial randomize (or add a switch to it)? If weight difference of different endpoints is very very large, like 1 and 9999, then the initial randomize number will very large. cc @adisuissa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took that into consideration when looking for how to fix the bug.
I'm not aware of a way to compute a gcd of non-integers. The alternative is to keep the bug as is (which also impacts a case of 2 hosts one with weight=1, and the other with weight=0.1). If you know of an efficient way to compute the length of the cycle with non-integer weights, please let me know.

The right way to fix this is probably convert this to static stride-scheduling (grpc implementations). However that will be a substantial change, and if added as an alternative, it will still not solve the current bug.

RE limiting the max number of randomized, I can probably add that somewhere (bootstrap?), but that seems more like a feature that will never go away to solve a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also tagging @tonya11en in case he has great ideas (he usually does :) )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @adisuissa is there any input about the problem that @nezdolik raised. Basically, I think this is only point that may block this PR. (Ignore my suggestion about the max number of randomize, we can try to resolve it until it actually become a problem)

// random value, 6 times the first host will be chosen, 3 times the second
// host will be chosen, and 1 time the third host will be chosen.
hostSet().healthy_hosts_ = {
makeTestHost(info_, "tcp://127.0.0.1:80", simTime(), 6),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also test with non integer weights?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see that original weights are integer only, but perhaps testing with slow start config is worth to cover as well.

@alyssawilk
Copy link
Contributor

Looks like this needs a main merge but @wbpcode is this waiting on you otherwise?

@wbpcode
Copy link
Member

wbpcode commented Oct 24, 2023

/wait-any

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa
Copy link
Contributor Author

Apologies for the late reply here.
I've modified the code to only work when there are no hosts in slow-start (and so host weights are integers).
@wbpcode @nezdolik PTAL.

// Compute the greatest common divisor of all the weights, to normalize
// the sum of weights to a smaller number. This can be avoided, but will
// result in higher numbers and more calls to `pickAndAdd()`.
const uint32_t host0_weight = static_cast<uint32_t>(std::ceil(hostWeight(*hosts[0])));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also tagging @tonya11en in case he has great ideas (he usually does :) )

for (uint32_t i = 0; i < seed_ % hosts.size(); ++i) {
auto host =
scheduler.edf_->pickAndAdd([this](const Host& host) { return hostWeight(host); });
if (Runtime::runtimeFeatureEnabled("envoy.reloadable_features.edf_lb_scheduler_init_fix") &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more aspect to consider, is that LeastRequest loadbalancer can also scale host weight based on number of active requests: https://github.com/envoyproxy/envoy/blob/main/source/common/upstream/load_balancer_impl.cc#L1276

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Let me address that.

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry I missed this last month (was on vacation). This problem looks eerily similar to the "first pick problem" that was discussed in #14597 and summarized in #14681. Hopefully I'm understanding this problem correctly. In what I linked, the gist of the problem was that in cases where you have 2 hosts (weight 99 and weight 1) that are picked after initializing the EDF scheduler, it would always pick the host with weight 99. I did a lot of investigation into that and landed on an alternative scheduling discipline that was merged in #14681 called WRSQ.

There were some snags that it appears you ran into as well with scenarios where the effective host weights are changing after each selection (like in slow-start and least request). We did conclude that there's no good way to maintain correctness in the selection probabilities without a linear sweep of the host set to perform a weight random selection.

I never got around to plugging in the WRSQ scheduler after implementing it and merging, so if it solves your problem you should consider using it here for the WRR cases. LMK if it makes sense or if I should write up something a bit more thorough.

@adisuissa
Copy link
Contributor Author

Hey, sorry I missed this last month (was on vacation). This problem looks eerily similar to the "first pick problem" that was discussed in #14597 and summarized in #14681. Hopefully I'm understanding this problem correctly. In what I linked, the gist of the problem was that in cases where you have 2 hosts (weight 99 and weight 1) that are picked after initializing the EDF scheduler, it would always pick the host with weight 99. I did a lot of investigation into that and landed on an alternative scheduling discipline that was merged in #14681 called WRSQ.

There were some snags that it appears you ran into as well with scenarios where the effective host weights are changing after each selection (like in slow-start and least request). We did conclude that there's no good way to maintain correctness in the selection probabilities without a linear sweep of the host set to perform a weight random selection.

I never got around to plugging in the WRSQ scheduler after implementing it and merging, so if it solves your problem you should consider using it here for the WRR cases. LMK if it makes sense or if I should write up something a bit more thorough.

Thanks @tonya11en for this, I wasn't aware of WRSQ. I'm not sure I understand how it solves the problem: IIUC the call to refresh() will still call pickAndAdd() twice in your example, and prepick_queue_ population seems to only be called on peekAgain().

Internally we'd like to end up with an algorithm that may be a bit lighter (static-stride-scheduler, see grpc implementation). However, as you pointed out, this will still not solve the underlying issue of non-integer weights.

I wonder if the right approach will be to separate the cases where weights are non-integers, as these also have some CPU performance.

@tonya11en
Copy link
Member

I'm not sure I understand how it solves the problem: IIUC the call to refresh() will still call pickAndAdd() twice in your example, and prepick_queue_ population seems to only be called on peekAgain().

My understanding of the problem here is that the selection probability of the "first pick" after the EDF scheduler is initialized is not correct, given the weights of the hosts in the set. This was measured in an earlier simulation with 2 hosts, weight 99 and weight 1. The weight 99 host was picked first every single time:

[ RUN      ] WRRLoadBalancerTest.1stPickEDF
url:10.0.0.0:6379, weight:99, hits:10000, observed_hit_pct:100

Using WRSQ made it so that the first pick honored the selection probabilities warranted by the host weights:

[ RUN      ] WRRLoadBalancerTest.1stPickWRSQ
url:10.0.0.0:6379, weight:99, hits:98982, observed_hit_pct:98.982
url:10.0.0.0:6379, weight:1, hits:1018, observed_hit_pct:1.018

Internally we'd like to end up with an algorithm that may be a bit lighter (static-stride-scheduler, see grpc implementation). However, as you pointed out, this will still not solve the underlying issue of non-integer weights.

What does "lighter" mean in this case? The tradeoff with using the WRSQ scheduler is that it has to do a linear sweep of all the hosts every time one of their weights are updated, which is what makes it unsuitable for LRLB. Outside of that case, it's performance characteristics are at least as fast as EDF.

I wonder if the right approach will be to separate the cases where weights are non-integers, as these also have some CPU performance.

This would be in all the cases where the weights are dynamically updated, like in least request LB or slow-start, right? There's no other way to handle this afaik.

@tonya11en
Copy link
Member

I should clarify that I'm not trying to push WRSQ on you here. Just thought it might be useful context for you.

@adisuissa
Copy link
Contributor Author

My understanding of the problem here is that the selection probability of the "first pick" after the EDF scheduler is initialized is not correct, given the weights of the hosts in the set. This was measured in an earlier simulation with 2 hosts, weight 99 and weight 1. The weight 99 host was picked first every single time:

[ RUN      ] WRRLoadBalancerTest.1stPickEDF
url:10.0.0.0:6379, weight:99, hits:10000, observed_hit_pct:100

Using WRSQ made it so that the first pick honored the selection probabilities warranted by the host weights:

[ RUN      ] WRRLoadBalancerTest.1stPickWRSQ
url:10.0.0.0:6379, weight:99, hits:98982, observed_hit_pct:98.982
url:10.0.0.0:6379, weight:1, hits:1018, observed_hit_pct:1.018

Nice, sorry if I missed these tests in the replies to the original issues.
I cannot find the tests (WRRLoadBalancerTest) in the codebase. Can you point me to the correct file?

Internally we'd like to end up with an algorithm that may be a bit lighter (static-stride-scheduler, see grpc implementation). However, as you pointed out, this will still not solve the underlying issue of non-integer weights.

What does "lighter" mean in this case? The tradeoff with using the WRSQ scheduler is that it has to do a linear sweep of all the hosts every time one of their weights are updated, which is what makes it unsuitable for LRLB. Outside of that case, it's performance characteristics are at least as fast as EDF.

"Lighter" - uses a single integer to decide which host to pick next. The integer is incremented every time a host is chosen.
The number of increments is bounded by the number of hosts (there's an underlying assumption is that O(N) integer arithmetics will be at least as fast as using data-structures such as a heap/queue).

I wonder if the right approach will be to separate the cases where weights are non-integers, as these also have some CPU performance.

This would be in all the cases where the weights are dynamically updated, like in least request LB or slow-start, right? There's no other way to handle this afaik.

Yeah, not sure what's the best approach here.

@KBaichoo
Copy link
Contributor

friendly ping @tonya11en , I think this is waiting on your response.

@tonya11en
Copy link
Member

I cannot find the tests (WRRLoadBalancerTest) in the codebase. Can you point me to the correct file?

This was just an existing test that I hacked together a long time ago. The patch is in tonya11en#1 if you wanted to take a look at what I was doing there.

Regarding the weighted random selections with non-integral weights, I'm not sure what folks here want to do about that problem (block PR, etc.). There's not a sub-linear selection algorithm (that I'm aware of) which is resilient to constantly changing and fractional weights like we see in the least request load balancer AND doesn't fall victim to this "first-pick problem".

I wasn't aware that I'm blocking progress here, so let me know what you need on my end.

@yanavlasov
Copy link
Contributor

/wait

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa
Copy link
Contributor Author

Assigning Harvey as a senior-maintainer and may have a better way to solve this issue.
/assign @htuch
TL;DR - this PR attempts to solve a bug where the value of the initial number of iterations to emulate a "pick" when hosts have different weights is incorrect.
Unfortunately it works only for integer-precision weights, but not for double-precision weights (Slow-Start, Least-Request).
The alternatives are:

  1. Not solving the bug (known issue - close the PR).
  2. Fixing the issue for non-Slow-start non-Least-Request (this PR).
  3. Adding a new load-balancer that is simpler but doesn't support double-precision weights (this could be planned concurrently to this work).
  4. Someone can come up with a better solution that will work for double-precision weights :-)

@htuch
Copy link
Member

htuch commented Dec 6, 2023

It feels like WRSQ should be where we throw effort going forward, and if there is a lightweight way to "fix" things here for integer only weights, that seems fine as well.

@tonya11en
Copy link
Member

It feels like WRSQ should be where we throw effort going forward, and if there is a lightweight way to "fix" things here for integer only weights, that seems fine as well.

Just FYI, the WRSQ scheduler implementation was already merged a while ago. It was just never plumbed into the relevant LBs (never finished review process for #18020).

@adisuissa
Copy link
Contributor Author

/wait

htuch pushed a commit that referenced this pull request Jan 25, 2024
…31592)

This is a follow up on #29953, with some insights about an additional scenario that has a similar problem.
The EDF scheduler is used in 2 levels of Zone-Aware LB policies:

1. For choosing the locality to use (here for healthy hosts). Note that as opposed to the problem in (2), this also impacts equal-weighted localities.
2. For choosing the host within the chosen locality (here).

Prior suggestions (WRSQ, #29953) focused on fixing level (2), and required converting the weights to integer-precision before performing the LB policy.
While this may work for (2), assuming one can convert the weights to integers by multiplying by some factor and truncating, for (1) this approach is more challenging as when computing the "effective locality weight" its value is dependent on the over-provisioning factor and the ratio of available hosts to all hosts. An additional benefit is that the current approach can also work with slow-start.

Thus, this PR suggests a different mechanism to "perform" some random number of picks when creating the EDF scheduler. The creation process is split into two steps:

1. Estimate a lower-bound on the number of picks each entry will be chosen, and initialize the internal EDF priority-queue accordingly.
2. Perform up to N "pickAndAdd" operations, where N is the number of localities/hosts.

Note that this approach is ~equal to performing some P picks from the scheduler where P >> N (~equal up to double precision computation differences and entries order of entries with the same weight).

After this PR, the next thing is to plumb the PRNG (or a random value) into the locality-scheduler creation process, and fix the call site for (1).

Here's a short doc providing the rationale behind this work: https://docs.google.com/document/d/1i53RYOy8sJUwL6_PSIF4yN39Xc1LvJtZ-18aSBZq0vM/edit?usp=sharing

Risk Level: low - code not used by Envoy's main codebase
Testing: Added tests to validate equivalence with prior approach.
Docs Changes: N/A
Release Notes:N/A - will be updated when the
Platform Specific Features: N/A

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa
Copy link
Contributor Author

Closing this in favor of #31592

@adisuissa adisuissa closed this Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants