From 9ded05824644ca77a04edcbddca2fd3ab13afd3f Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 25 Apr 2024 11:30:54 -0600 Subject: [PATCH 01/13] first draft --- A80-pid.md | 271 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 271 insertions(+) create mode 100644 A80-pid.md diff --git a/A80-pid.md b/A80-pid.md new file mode 100644 index 000000000..90d5bd2ee --- /dev/null +++ b/A80-pid.md @@ -0,0 +1,271 @@ +A68: PID LB policy. +---- +* Author(s): @s-matyukevich +* Approver: +* Status: Draft +* Implemented in: PoC in Go +* Last updated: 2024-04-18 +* Discussion at: + +## Abstract + +This document proposes a design for a new load balancing policy `pid`. `pid` stands for [Proportional–integral–derivative controller](https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller). This policy is built on top of [A58: weighted_round_robin LB policy (WRR)][A58] and also requires direct load reporting from backends to clients. Like `wrr` it also uses client-side weighted round robin load balancing. Unlike `wrr` it doesn't build weights deterministically, instead it uses feedback loop with `pid` controller to adjust the weights in such a way that load on all backends converge to the same value. The policy supports either per-call or periodic out-of-band load reporting per [gRFC A51][A51]. + +## Background + +`wrr` uses the following formula to calculate the subchannel weights: + +$$weight = \dfrac{qps}{utilization + \dfrac{eps}{qps} * error\\_utilization\\_penalty}$$ + +This works great in cases when backends have different average CPU cost per request and recieve identical number of connections. In such cases `wrr` helps to distribute requests between backends fairly, such that more powerfull backends recieve more requests and less powerfull backends recieve less requests. However, `wrr` doesn't help at all to correct the imbalance generated by usage of random subsetting, as described in [gRFC A68][A68] This is because the usage of random subsetting results in a state where some backends recieve more connections than others. The number of connections a server recieves doesn't affect its CPU cost per request metric, so more connected backends will be reciving more requests than less connected backends. + +`pid` balancer takes a different aproach: instead of deterministically calculating weights based on some backend metric, it keeps adjusting weights at runtime and uses a feedback loop based on backend CPU metric to determine the direction and maginute for every weight update. + +### Related Proposals: +* [gRFC A51][A51] +* [gRFC A52][A52] +* [gRFC A58][A58] +* [gRFC A68][A68] + +## Proposal + +Introduce a new LB policy `pid`. This policy implements client-side load balancing with direct load reports from backends. It uses feedback look with PID controller to tune the weights. The policy is otherwise largely a superset of the existing policy `weighted_round_robin`. + +### LB Policy Config and Parameters + +The `pid` LB policy config will be as follows. + +```textproto +message LoadBalancingConfig { + oneof policy { + PIDLbConfig pid = 20 [json_name = "pid"]; + } +} + +message PIDLbConfig { + // The config for the WRR load balancer, as defined in [gRFC A58][A58] + // PID balancer is an extenstion of WRR and all WRR settings apply to PID in an identical way. + WeightedRoundRobinLbConfig wrr_config = 1; + + // Only after eps/qps grows past this value the balancer starts taking into account ErrorUtilizationPenalty. + // This is necessary to avoid oscilations in cases when server has very high and spiky error rate. + // Even in such cases we don't want to remove error_utilization_penalty completely as then we could + // redirect all traffic to an instance that has low CPU and simply rejects all requests. + // Default is 0.5. + google.protobuf.FloatValue error_utilization_threshold = 2; + + // Controls how fast PID controller converges to mean value. Higher values speed up convergence + // but can result in oscillations. Oscillations can happen if server load changes faster than + // PID controller can react to the changes. This could happen if there are significant delays in + // load report propagation (for example, due to use of OOB reporting with large load period) or if + // server load is very spiky. To deal with spiky server load server owners should use moving average + // on the server to smooth load function. + // Default is 0.1. + google.protobuf.FloatValue proportional_gain = 2; + + // Controls how smooth PID controller convergence is. Higher values makes it more smooth, + // but can slow down convergence. + // Default is 1. + google.protobuf.FloatValue derivative_gain = 4; + + // Max allowed weight. If PID controller attempts to set a higher weight it will be capped at this value. + // This is necessary to prevent weights growing to infinity, which could happen if only a subset of clients + // is using PID and we reached a point after which increasing weights no longer help to correct the imbalance. + // The default is 10. + google.protobuf.FloatValue max_weight = 5; + + // Min allowed weight. If PID controller attempts to set a lower weight it will be capped at this value. + // This is necessary to prevent weights dropping to zero, which could happen if only a subset of clients + // is using PID and we reached a point after which decreasing weights no longer help to correct the imbalance. + // The default is 0.1 + google.protobuf.FloatValue min_weight = 6; +} +``` + +### PID controller + +A PID controller is a control loop mechanism employing feedback. It continuously calculates an error value as the difference between a desired setpoint (`referenceSignal`) and a measured process variable (`actualSignal`) and applies a correction based on proportional, integral, and derivative terms (denoted P, I, and D respectively), hence the name. + +In our implementation we won't be using integral part (It is usefull to speed up convergence when `referenceSignal` changes shraply. In our case we will be conveging the load on the subchannels to the mean value, which is mostly stable) + +Here is a sample implementation in pseudo-code. + +``` +pidController class { + proportionalGain float + derivativeGain float + + controlError float + + update(referenceSignal float, actualSignal float, samplingInterval duration) float { + previousError = this.controlError + // save last controlError so we can use it to calculate derivative during next update + this.controlError = referenceSignal - actualSignal + controlErrorDerivative = (controlError - previousError) / samplingInterval.Seconds() + controlSignal = this.proportionalGain*this.ontrolError + + this.derivativeGain*this.controlErrorDerivative + } +} +``` + +`update` method is expected to be called on a regular basis. `samplingInterval` is the duration since the last update. Return value is the control signal which, if applied to the system, should minimize control error. In the next section we'll discuss how control signal is converted to `wrr` weight. + +`proportionalGain` and `derivativeGain` parameters are taken from the lb config. `proportionalGain` should be first scaled by the `WeightUpdatePeriod` value. This is because `controlErrorDerivative` is inversly proportional to the sampling interval, which in turn is close to `WeightUpdatePeriod` as we will be updating PID state once per `WeightUpdatePeriod`. If `WeightUpdatePeriod` is too small `controlErrorDerivative` becomes too large and dominates the resulting controll error. + +### Extending WRR balancer + +`pid` balancer reuses 90% of `wrr` code. The proposal is to refactor `wrr` and add a few hooks to it so that other balancers (like `pid`) could reuse the code without copy-paste. We should keep those hooks internal (at least at first) to avoid locking ourselves into a new public API. This is mostly language specific, but the idea is to do the following: +* Add `callbacks` object to wrr balancer. This object contains a few callbacks that `wrr` will be calling during various stages of its lifetime. +* Add `callbackData` object, which will be used by callbacks to store any data that is reused between callbacks. The balancer will be passing it all callbacks and othwerwise treat it as an opaque blob of data. + +`callbacks` object will be provided by the balancer builder. This object implements the following interface (written in pseudo-code) + +``` +wrrCallbacks interface { + onSubchannelAdded(subchannelID int, data callbackData) + + onSubchannelRemoved(subchannelID int, data callbackData) + + // onLoadReport is called when a new load report is recieved for a given subchannel. + // This function returns the new weight for a subchannel. If returned value is -1 + // the subchannel should keep using the old value. + // onLoadReport won't be called during blackout period. + onLoadReport(subchannelId int, data callbackData, conf lbConfig, report loadReport) float + + // onEDFSchedulerUpdate is called after wrr balancer recreates the EDF scheduler. + onEDFSchedulerUpdate(data callbackData) +} +``` + +`pid` balancer implements those callbacks as follows: + +``` +func onSubchannelAdded(subchannelID int) { + // do nothing +} + +func onSubchannelRemoved(subchannelID int) { + // remove subchannelID from 2 maps + // that store the value of last utiliztion and + // last applied weight per subchannel + delete(subchannelID, data.utilizationPerSubchannel) + delete(subchannelID,data.lastAppliedWeightPerSubchannel) +} + +func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadReport, lastApplied time) float { + utilization = load.ApplicationUtilization + if utilization == 0 { + utilization = load.CpuUtilization + } + if utilization == 0 || load.RpsFractional == 0 { + // ignore empty load + return -1 + } + errorRate = load.Eps / load.RpsFractional + useErrPenalty = errorRate > conf.ErrorUtilizationThreshold + if useErrPenalty { + utilization += errorRate * conf.ErrorUtilizationPenalty + } + + // Make sure at least WeightUpdatePeriod has passed since we last updated PID state. + // If we don't do that PID controller internal state may get corrupted in 2 ways: + // * If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. + // * If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, + // but PID controller keep growing the weight and it may easily pass the balancing point. + if time.Since(lastApplied) < conf.WeightUpdatePeriod { + return -1 + } + + // use value calculated in the onEDFSchedulerUpdate method + meanUtilization = data.meanUtilization + + // call PID controlelr to get the value of the control signal. + controlSignal = data.pidController.update({ + referenceSignal: meanUtilization, + actualSignal: utilization, + samplingInterval: time.Since(lastApplied), + }) + + // Normalize the signal. + // If meanUtilization ~= 0 the signal will be ~= 0 as well, and convergence will becoma painfully slow. + // If, meanUtilization >> 1 the signal may become very high, which could lead to oscillations. + if meanUtilization > 0 { + controlSignal *= 1 / meanUtilization + } + + lastAppliedWeight = data.lastAppliedWeightPerSubchannel[subchannelID] + + // Use controlSignal to adjust the weight. + // First caclulate multiplier that will be used to determine how much weight should be changed. + // The higher is the absolute value of the controlSignal the more we need to adjust the weight. + if controlSignal >= 0 { + // in this case mult should belong to [1,inf) interval, so we will be increasing the weight. + mult = 1.0 + controlSignal + } else { + // in this case mult should belong to (0, 1) interval, so we will be decreasing the weight. + mult = -1.0 / (controlSignal - 1.0) + } + weight = lastAppliedWeight * mult + // clamp weight at min/max values to avoid growing to infinity or zero. + if weight > conf.MaxWeight { + weight = conf.MaxWeight + } + if weight < conf.MinWeight { + weight = conf.MinWeight + } + + // save resulting utilization and weight. + data.utilizationPerSubchannel[subchannelId] = utilization + data.lastAppliedWeightPerSubchannel[subchannelID] = weight + + return weight +} + +func onEDFSchedulerUpdate(data callbackData) { + // simply calculate mean utilization for all subchannels + sum = 0 + foreach key, value in data.utilizationPerSubchannel { + sum += value + } + data.meanUtilization = sum / data.utilizationPerSubchannel.length() +} +``` + +### Deling with oscilations + +The main problem with using `pid` balancer is the probability of oscilations. This probability depends on the following factors + +* How fast load reports are propagated to clients. The larger is the delay the higher is the chances that `pid` balancer will keep adjusting weights in the wrong direction, which could lead to oscilations. There are 2 cases we should consider: + * `Direct load reporting`. In this case propagation delay depends on the request frequency and `WeightUpdatePeriod` setting. In practise this results in very fast propagation with default `WeightUpdatePeriod` value (1s) and this is the prefered option when using `pid` + * `OOB load reporting`. In this case users can control the delay by using `OobReportingPeriod` setting. The delay in this case is usually much larger, still it is possible to acieve perfect convergence with OOB reporting on workloads with stable load. +* `ProportionalGain` value. If it is too high `pid` balancer will be making big adjustments to the weights and may pass the balancing point. Default value (0.1) result in relatively fast convergence (usually faster than 30 sec) on non spiky workloads. +* How stable is server load. `pid` balancer don't work very well with servers that have spiky load. The main reason for this is that mean utilization is not stable, which constantly disturb the convergence direction for all subchannels. This is the only property users can't directly control on the clinet side. The proposal is to add an "average window" mechanism to deal with it on the server. This will be discussed in the next section. +* How large is the number of subchannels. The larger it is, the more stable mean utilization is, which result in faster convergence and no oscilatioins. This is directly related to the usage of random subsetting discussed in [gRFC A68][A68]. If someone chooses too small subset size `pid` may have hard time converging the load across backends both because mean utilization is unstable and because a lot of clients may get connection only to overloaded or only to underloaded server - such clients won't contribute much to achieving overal convergence. Still we were able to get ok convergence on a spiky workload with ridiculasly small subset size of 4 with 3 minutes moving average window size for load reporting. Proposed default subset size of 20 usually results in good convergence on any workload. + +### Moving awerage window for load reporting + +As mentioned in the prvious section, we need a mechanizm to make utilization more smooth in the server load reports, otherwise `pid` balancer might not be able to acieve convergence on spiky workloads. The proposed solution is to use moving average window when reporting results for a particular backend metrics. We should extend `MetricRecorder` component, which is desribed in [gRFC A51][A51] and add `MovingAverageWindowSize` parameter to it. Intead of storing a single value per metric, `MetricRecorder` now will store `MovingAverageWindowSize` last reported values. Whenever `recordMetricXXX` method is called `MetricRecorder` will add the new value to the circular buffer and remove the oldest value from the same buffer. This is ilustrated in the follwing pseudo-code example + +``` +func recordMetricXXX(value float) { + // make sure the updates are atomic, otherwise we risk getting corrupted cicular buffer + lock.Lock() + // this automatically removes the last added value if cicular buffer is full + circularBufferForMerricXXX.add(value) + + sum = 0 + foreach val in circularBufferForMerricXXX { + sum += val + } + metricXXXvalue = sum/circularBufferForMerricXXX.size() + lock.Unlock() +} +``` + +Setting `MovingAverageWindowSize` is identical to using tu current behaviour, and should be the default. + + + +[A51]: A51-custom-backend-metrics.md +[A58]: A58-client-side-weighted-round-robin-lb-policy.md +[A68]: https://github.com/grpc/proposal/pull/423 \ No newline at end of file From f1a466177988fce09d382f42a8a537afca5c7f63 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 25 Apr 2024 12:14:57 -0600 Subject: [PATCH 02/13] chatGPT review --- A80-pid.md | 232 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 137 insertions(+), 95 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index 90d5bd2ee..d400978f2 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -9,17 +9,18 @@ A68: PID LB policy. ## Abstract -This document proposes a design for a new load balancing policy `pid`. `pid` stands for [Proportional–integral–derivative controller](https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller). This policy is built on top of [A58: weighted_round_robin LB policy (WRR)][A58] and also requires direct load reporting from backends to clients. Like `wrr` it also uses client-side weighted round robin load balancing. Unlike `wrr` it doesn't build weights deterministically, instead it uses feedback loop with `pid` controller to adjust the weights in such a way that load on all backends converge to the same value. The policy supports either per-call or periodic out-of-band load reporting per [gRFC A51][A51]. +This document proposes a design for a new load balancing policy called pid. The term pid stands for [Proportional–integral–derivative controller](https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller). This policy builds upon the [A58: weighted_round_robin LB policy (WRR)][A58] and requires direct load reporting from backends to clients. Similar to wrr, it utilizes client-side weighted round robin load balancing. However, unlike wrr, it does not determine weights deterministically. Instead, it employs a feedback loop with the pid controller to adjust the weights in a manner that allows the load on all backends to converge to the same value. The policy supports either per-call or periodic out-of-band load reporting as per [gRFC A51][A51]. ## Background -`wrr` uses the following formula to calculate the subchannel weights: +The `wrr` policy uses the following formula to calculate subchannel weights:: $$weight = \dfrac{qps}{utilization + \dfrac{eps}{qps} * error\\_utilization\\_penalty}$$ -This works great in cases when backends have different average CPU cost per request and recieve identical number of connections. In such cases `wrr` helps to distribute requests between backends fairly, such that more powerfull backends recieve more requests and less powerfull backends recieve less requests. However, `wrr` doesn't help at all to correct the imbalance generated by usage of random subsetting, as described in [gRFC A68][A68] This is because the usage of random subsetting results in a state where some backends recieve more connections than others. The number of connections a server recieves doesn't affect its CPU cost per request metric, so more connected backends will be reciving more requests than less connected backends. +This formula is effective when backends, which have different average CPU costs per request, receive an identical number of connections. In such scenarios, `wrr` aids in fairly distributing requests between backends, ensuring that more powerful backends receive more requests, and less powerful backends receive fewer requests. However, `wrr` is not effective in correcting imbalances generated by the use of random subsetting, as described in [gRFC A68][A68]. This is because random subsetting leads to a situation where some backends receive more connections than others. The number of connections a server receives does not impact its CPU cost per request metric, so more connected backends will end up receiving more requests than less connected ones. + +The `pid` balancer takes a different approach: instead of deterministically calculating weights based on a backend metric, it continuously adjusts weights at runtime. It utilizes a feedback loop based on the backend CPU metric to determine the direction and magnitude of every weight update. -`pid` balancer takes a different aproach: instead of deterministically calculating weights based on some backend metric, it keeps adjusting weights at runtime and uses a feedback loop based on backend CPU metric to determine the direction and maginute for every weight update. ### Related Proposals: * [gRFC A51][A51] @@ -29,7 +30,7 @@ This works great in cases when backends have different average CPU cost per requ ## Proposal -Introduce a new LB policy `pid`. This policy implements client-side load balancing with direct load reports from backends. It uses feedback look with PID controller to tune the weights. The policy is otherwise largely a superset of the existing policy `weighted_round_robin`. +Introduce a new LB policy `pid`. This policy implements client-side load balancing with direct load reports from backends. It utilizes a feedback loop with a PID controller to dynamically adjust the weights. The policy is otherwise largely a superset of the existing policy `weighted_round_robin`. ### LB Policy Config and Parameters @@ -43,52 +44,45 @@ message LoadBalancingConfig { } message PIDLbConfig { - // The config for the WRR load balancer, as defined in [gRFC A58][A58] - // PID balancer is an extenstion of WRR and all WRR settings apply to PID in an identical way. + // Configuration for the WRR load balancer as defined in [gRFC A58][A58]. + // The PID balancer is an extension of WRR and all settings applicable to WRR also apply to PID identically. WeightedRoundRobinLbConfig wrr_config = 1; - // Only after eps/qps grows past this value the balancer starts taking into account ErrorUtilizationPenalty. - // This is necessary to avoid oscilations in cases when server has very high and spiky error rate. - // Even in such cases we don't want to remove error_utilization_penalty completely as then we could - // redirect all traffic to an instance that has low CPU and simply rejects all requests. - // Default is 0.5. + // Threshold beyond which the balancer starts considering the ErrorUtilizationPenalty. + // This helps avoid oscillations in cases where the server experiences a very high and spiky error rate. + // We avoid eliminating the error_utilization_penalty entirely to prevent redirecting all traffic to an instance + // that has low CPU usage but rejects all requests. Default is 0.5. google.protobuf.FloatValue error_utilization_threshold = 2; - // Controls how fast PID controller converges to mean value. Higher values speed up convergence - // but can result in oscillations. Oscillations can happen if server load changes faster than - // PID controller can react to the changes. This could happen if there are significant delays in - // load report propagation (for example, due to use of OOB reporting with large load period) or if - // server load is very spiky. To deal with spiky server load server owners should use moving average - // on the server to smooth load function. - // Default is 0.1. + // Controls the convergence speed of the PID controller. Higher values accelerate convergence but may induce oscillations, + // especially if server load changes more rapidly than the PID controller can adjust. Oscillations might also occur due to + // significant delays in load report propagation or extremely spiky server load. To mitigate spiky loads, server owners should + // employ a moving average to smooth the load reporting. Default is 0.1. google.protobuf.FloatValue proportional_gain = 2; - // Controls how smooth PID controller convergence is. Higher values makes it more smooth, - // but can slow down convergence. + // Adjusts the smoothness of the PID controller convergence. Higher values enhance smoothness but can decelerate convergence. // Default is 1. google.protobuf.FloatValue derivative_gain = 4; - // Max allowed weight. If PID controller attempts to set a higher weight it will be capped at this value. - // This is necessary to prevent weights growing to infinity, which could happen if only a subset of clients - // is using PID and we reached a point after which increasing weights no longer help to correct the imbalance. - // The default is 10. + // Maximum allowable weight. Weights proposed by the PID controller exceeding this value will be capped. + // This prevents infinite weight growth, which could occur if only a subset of clients uses PID and increasing weights + // no longer effectively corrects the imbalance. Default is 10. google.protobuf.FloatValue max_weight = 5; - // Min allowed weight. If PID controller attempts to set a lower weight it will be capped at this value. - // This is necessary to prevent weights dropping to zero, which could happen if only a subset of clients - // is using PID and we reached a point after which decreasing weights no longer help to correct the imbalance. - // The default is 0.1 + // Minimum allowable weight. Weights proposed by the PID controller falling below this value will be capped. + // This prevents weights from dropping to zero, which could occur if only a subset of clients uses PID and decreasing weights + // no longer effectively corrects the imbalance. Default is 0.1. google.protobuf.FloatValue min_weight = 6; } ``` ### PID controller -A PID controller is a control loop mechanism employing feedback. It continuously calculates an error value as the difference between a desired setpoint (`referenceSignal`) and a measured process variable (`actualSignal`) and applies a correction based on proportional, integral, and derivative terms (denoted P, I, and D respectively), hence the name. +A PID controller is a control loop feedback mechanism that continuously calculates an error value as the difference between a desired setpoint (`referenceSignal`) and a measured process variable (`actualSignal`). It then applies a correction based on proportional, integral, and derivative terms (denoted P, I, and D respectively), hence the name. -In our implementation we won't be using integral part (It is usefull to speed up convergence when `referenceSignal` changes shraply. In our case we will be conveging the load on the subchannels to the mean value, which is mostly stable) +In our implementation, we will not be using the integral part. The integral component is useful for speeding up convergence when the `referenceSignal` changes sharply. In our case, we will be converging the load on the subchannels to a mean value, which is mostly stable. -Here is a sample implementation in pseudo-code. +Here is a sample implementation in pseudo-code: ``` pidController class { @@ -99,76 +93,103 @@ pidController class { update(referenceSignal float, actualSignal float, samplingInterval duration) float { previousError = this.controlError - // save last controlError so we can use it to calculate derivative during next update + // Save last controlError so we can use it to calculate derivative during next update this.controlError = referenceSignal - actualSignal - controlErrorDerivative = (controlError - previousError) / samplingInterval.Seconds() - controlSignal = this.proportionalGain*this.ontrolError + - this.derivativeGain*this.controlErrorDerivative + controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds() + controlSignal = this.proportionalGain * this.controlError + + this.derivativeGain * controlErrorDerivative + return controlSignal } } ``` -`update` method is expected to be called on a regular basis. `samplingInterval` is the duration since the last update. Return value is the control signal which, if applied to the system, should minimize control error. In the next section we'll discuss how control signal is converted to `wrr` weight. +The `update` method is expected to be called on a regular basis, with `samplingInterval` being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to `wrr` weight. + + +Your explanation of the PID controller's implementation and pseudo-code looks good, but there are a few typographical errors and clarifications that could be made to improve readability and accuracy. Here's a revised version of your description and pseudo-code: + +A PID controller is a control loop feedback mechanism that continuously calculates an error value as the difference between a desired setpoint (referenceSignal) and a measured process variable (actualSignal). It then applies a correction based on proportional, integral, and derivative terms (denoted P, I, and D respectively), hence the name. + +In our implementation, we will not be using the integral part. The integral component is useful for speeding up convergence when the referenceSignal changes sharply. In our case, we will be converging the load on the subchannels to a mean value, which is mostly stable. + +Here is a sample implementation in pseudo-code: + +pseudo +Copy code +pidController class { + proportionalGain float + derivativeGain float + + controlError float + + update(referenceSignal float, actualSignal float, samplingInterval duration) float { + previousError = this.controlError + // Save last controlError so we can use it to calculate derivative during next update + this.controlError = referenceSignal - actualSignal + controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds() + controlSignal = this.proportionalGain * this.controlError + + this.derivativeGain * controlErrorDerivative + return controlSignal + } +} +The update method is expected to be called on a regular basis, with samplingInterval being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to wrr weight. -`proportionalGain` and `derivativeGain` parameters are taken from the lb config. `proportionalGain` should be first scaled by the `WeightUpdatePeriod` value. This is because `controlErrorDerivative` is inversly proportional to the sampling interval, which in turn is close to `WeightUpdatePeriod` as we will be updating PID state once per `WeightUpdatePeriod`. If `WeightUpdatePeriod` is too small `controlErrorDerivative` becomes too large and dominates the resulting controll error. +The `proportionalGain` and `derivativeGain` parameters are taken from the LB config. `proportionalGain` should be scaled by the `WeightUpdatePeriod` value. This is necessary because `controlErrorDerivative` is inversely proportional to the sampling interval, which in turn is close to the `WeightUpdatePeriod` as we will be updating the PID state once per `WeightUpdatePeriod`. If `WeightUpdatePeriod` is too small, `controlErrorDerivative` becomes too large and dominates the resulting control error. ### Extending WRR balancer -`pid` balancer reuses 90% of `wrr` code. The proposal is to refactor `wrr` and add a few hooks to it so that other balancers (like `pid`) could reuse the code without copy-paste. We should keep those hooks internal (at least at first) to avoid locking ourselves into a new public API. This is mostly language specific, but the idea is to do the following: -* Add `callbacks` object to wrr balancer. This object contains a few callbacks that `wrr` will be calling during various stages of its lifetime. -* Add `callbackData` object, which will be used by callbacks to store any data that is reused between callbacks. The balancer will be passing it all callbacks and othwerwise treat it as an opaque blob of data. +The `pid` balancer reuses 90% of the wrr code. The proposal is to refactor the `wrr` codebase and introduce several hooks that allow other balancers, like `pid`, to reuse the code efficiently without the need for duplication. Initially, these hooks will remain internal to avoid prematurely establishing a new public API. This approach is mostly language-specific, but the general plan is as follows: -`callbacks` object will be provided by the balancer builder. This object implements the following interface (written in pseudo-code) +* Add a `callbacks` object to the `wrr` balancer: This object will contain a series of callback functions that wrr will invoke at various stages of its lifecycle. +* Introduce a `callbackData` object: This will be utilized by the callbacks to store any data that is reused across different callback functions. The `wrr` balancer will pass this object to all callbacks and treat it as an opaque blob of data. + +The `callbacks` object, which is to be provided by the balancer builder, will implement the following interface (expressed in pseudo-code): ``` wrrCallbacks interface { onSubchannelAdded(subchannelID int, data callbackData) - onSubchannelRemoved(subchannelID int, data callbackData) - // onLoadReport is called when a new load report is recieved for a given subchannel. - // This function returns the new weight for a subchannel. If returned value is -1 + // onLoadReport is called when a new load report is received for a given subchannel. + // This function returns the new weight for a subchannel. If the returned value is -1, // the subchannel should keep using the old value. - // onLoadReport won't be called during blackout period. + // onLoadReport won't be called during the blackout period. onLoadReport(subchannelId int, data callbackData, conf lbConfig, report loadReport) float - // onEDFSchedulerUpdate is called after wrr balancer recreates the EDF scheduler. + // onEDFSchedulerUpdate is called after the wrr balancer recreates the EDF scheduler. onEDFSchedulerUpdate(data callbackData) } -``` - -`pid` balancer implements those callbacks as follows: -``` -func onSubchannelAdded(subchannelID int) { - // do nothing +// Implementation for PID balancer +func onSubchannelAdded(subchannelID int, data callbackData) { + // Do nothing } -func onSubchannelRemoved(subchannelID int) { - // remove subchannelID from 2 maps - // that store the value of last utiliztion and - // last applied weight per subchannel - delete(subchannelID, data.utilizationPerSubchannel) - delete(subchannelID,data.lastAppliedWeightPerSubchannel) +func onSubchannelRemoved(subchannelID int, data callbackData) { + // Remove subchannelID from two maps that store the value of last utilization + // and last applied weight per subchannel + delete(data.utilizationPerSubchannel, subchannelID) + delete(data.lastAppliedWeightPerSubchannel, subchannelID) } + func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadReport, lastApplied time) float { utilization = load.ApplicationUtilization - if utilization == 0 { - utilization = load.CpuUtilization - } - if utilization == 0 || load.RpsFractional == 0 { - // ignore empty load - return -1 - } + if utilization == 0 { + utilization = load.CpuUtilization + } + if utilization == 0 || load.RpsFractional == 0 { + // Ignore empty load + return -1 + } errorRate = load.Eps / load.RpsFractional - useErrPenalty = errorRate > conf.ErrorUtilizationThreshold - if useErrPenalty { - utilization += errorRate * conf.ErrorUtilizationPenalty - } + useErrPenalty = errorRate > conf.ErrorUtilizationThreshold + if useErrPenalty { + utilization += errorRate * conf.ErrorUtilizationPenalty + } - // Make sure at least WeightUpdatePeriod has passed since we last updated PID state. - // If we don't do that PID controller internal state may get corrupted in 2 ways: + // Ensure at least WeightUpdatePeriod has passed since the last update. + // Prevents corruption of PID controller's internal state, which could happen in the following cases: // * If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. // * If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, // but PID controller keep growing the weight and it may easily pass the balancing point. @@ -206,7 +227,8 @@ func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadR mult = -1.0 / (controlSignal - 1.0) } weight = lastAppliedWeight * mult - // clamp weight at min/max values to avoid growing to infinity or zero. + + // Clamp weight if weight > conf.MaxWeight { weight = conf.MaxWeight } @@ -214,7 +236,7 @@ func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadR weight = conf.MinWeight } - // save resulting utilization and weight. + // Save resulting utilization and weight. data.utilizationPerSubchannel[subchannelId] = utilization data.lastAppliedWeightPerSubchannel[subchannelID] = weight @@ -222,48 +244,68 @@ func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadR } func onEDFSchedulerUpdate(data callbackData) { - // simply calculate mean utilization for all subchannels - sum = 0 - foreach key, value in data.utilizationPerSubchannel { - sum += value + // Calculate mean utilization across all subchannels + totalUtilization = 0 + count = len(data.utilizationPerSubchannel) + for _, utilization in data.utilizationPerSubchannel { + totalUtilization += utilization } - data.meanUtilization = sum / data.utilizationPerSubchannel.length() + data.meanUtilization = totalUtilization / count } ``` -### Deling with oscilations +### Dealing with Oscillations -The main problem with using `pid` balancer is the probability of oscilations. This probability depends on the following factors +One of the main challenges with the pid balancer is the potential for oscillations. Several factors influence this likelihood: -* How fast load reports are propagated to clients. The larger is the delay the higher is the chances that `pid` balancer will keep adjusting weights in the wrong direction, which could lead to oscilations. There are 2 cases we should consider: - * `Direct load reporting`. In this case propagation delay depends on the request frequency and `WeightUpdatePeriod` setting. In practise this results in very fast propagation with default `WeightUpdatePeriod` value (1s) and this is the prefered option when using `pid` - * `OOB load reporting`. In this case users can control the delay by using `OobReportingPeriod` setting. The delay in this case is usually much larger, still it is possible to acieve perfect convergence with OOB reporting on workloads with stable load. -* `ProportionalGain` value. If it is too high `pid` balancer will be making big adjustments to the weights and may pass the balancing point. Default value (0.1) result in relatively fast convergence (usually faster than 30 sec) on non spiky workloads. -* How stable is server load. `pid` balancer don't work very well with servers that have spiky load. The main reason for this is that mean utilization is not stable, which constantly disturb the convergence direction for all subchannels. This is the only property users can't directly control on the clinet side. The proposal is to add an "average window" mechanism to deal with it on the server. This will be discussed in the next section. -* How large is the number of subchannels. The larger it is, the more stable mean utilization is, which result in faster convergence and no oscilatioins. This is directly related to the usage of random subsetting discussed in [gRFC A68][A68]. If someone chooses too small subset size `pid` may have hard time converging the load across backends both because mean utilization is unstable and because a lot of clients may get connection only to overloaded or only to underloaded server - such clients won't contribute much to achieving overal convergence. Still we were able to get ok convergence on a spiky workload with ridiculasly small subset size of 4 with 3 minutes moving average window size for load reporting. Proposed default subset size of 20 usually results in good convergence on any workload. +1. **Propagation Delay of Load Reports:** + * **Direct Load Reporting**: Here, the delay depends on the request frequency and the `WeightUpdatePeriod` setting. Typically, with the default `WeightUpdatePeriod` of 1 second, propagation is very fast, making this the preferred option when using the pid balancer. + * **OOB Load Reporting**: Users can control the delay by adjusting the `OobReportingPeriod` setting. While the delay is usually larger compared to direct reporting, achieving perfect convergence with OOB reporting is still possible on workloads with stable loads. +2. **Proportional Gain:** + * A high `ProportionalGain` can lead to significant weight adjustments, potentially overshooting the balancing point. The default value of 0.1 generally allows for fast convergence (typically faster than 30 seconds—on workloads that are not spiky) while not generating oscillations. +3. **Stability of Server Load:** + * The pid balancer struggles with servers that exhibit spiky loads because the mean utilization is not stable, which disrupts the convergence direction for all subchannels. Unfortunately, this is one aspect users cannot directly control from the client side. To address this, the proposal includes implementing an "average window" mechanism on the server, which will be discussed in the next section. +4. **Number of Subchannels:** + * A larger number of subchannels generally stabilizes the mean utilization, leading to faster convergence and reducing the likelihood of oscillations. This is particularly relevant when considering the use of random subsetting, as discussed in [gRFC A68][A68]. A small subset size can hinder the `pid` balancer’s ability to converge load across backends, particularly if mean utilization is unstable and clients connect only to either overloaded or underloaded servers. However, we have achieved acceptable convergence on a spiky workload with a subset size as small as four, using a three-minute moving average window size for load reporting. A proposed default subset size of 20 typically ensures good convergence on any workload. -### Moving awerage window for load reporting +By understanding and addressing these factors, the `pid` balancer can be more effectively tuned to manage load balancing across different environments and usage scenarios, minimizing the risks associated with oscillations. -As mentioned in the prvious section, we need a mechanizm to make utilization more smooth in the server load reports, otherwise `pid` balancer might not be able to acieve convergence on spiky workloads. The proposed solution is to use moving average window when reporting results for a particular backend metrics. We should extend `MetricRecorder` component, which is desribed in [gRFC A51][A51] and add `MovingAverageWindowSize` parameter to it. Intead of storing a single value per metric, `MetricRecorder` now will store `MovingAverageWindowSize` last reported values. Whenever `recordMetricXXX` method is called `MetricRecorder` will add the new value to the circular buffer and remove the oldest value from the same buffer. This is ilustrated in the follwing pseudo-code example +### Moving Average Window for Load Reporting + +As outlined in the previous section, smoothing the utilization measurements in server load reports is essential for the `pid` balancer to achieve convergence on spiky workloads. To address this, we propose integrating a moving average window mechanism into the `MetricRecorder` component, as described in [gRFC A51][A51]. This involves adding a `MovingAverageWindowSize` parameter to the component. Instead of storing a single value per metric, `MetricRecorder` will now maintain the last MovingAverageWindowSize reported values in a circular buffer. The process is detailed in the following pseudo-code: ``` func recordMetricXXX(value float) { - // make sure the updates are atomic, otherwise we risk getting corrupted cicular buffer + // Ensure updates are atomic to avoid corruption of the circular buffer lock.Lock() - // this automatically removes the last added value if cicular buffer is full - circularBufferForMerricXXX.add(value) + // Add the new value to the circular buffer, which automatically removes the oldest value if the buffer is full + circularBufferForMetricXXX.add(value) sum = 0 - foreach val in circularBufferForMerricXXX { + // Calculate the average of the values in the circular buffer + foreach val in circularBufferForMetricXXX { sum += val } - metricXXXvalue = sum/circularBufferForMerricXXX.size() + metricXXXvalue = sum / circularBufferForMetricXXX.size() lock.Unlock() } ``` -Setting `MovingAverageWindowSize` is identical to using tu current behaviour, and should be the default. +Setting `MovingAverageWindowSize` to 1 mimics the current behavior and should remain the default setting. +This modification allows for more stable load reporting by averaging fluctuations over the specified window, thus providing the pid balancer with more consistent data to inform weight adjustments. + + +## Rationale +### Alternatives Considered: + +The main driver for this propsal was the need to implement subsetting. We explored the possibility of using deterministic subsetting in https://github.com/grpc/proposal/pull/383 and got push-back on this for the reasons explained [here](https://github.com/grpc/proposal/pull/383#discussion_r1334587561) + +Additionally, we considered the "scaled wrr" approach, which would adjust the imbalance created by random subsetting by multiplying the server utilization by the number of connections a server receives. Feedback on this approach suggested that it might be more beneficial to pursue more generic solutions that focus on achieving load convergence rather than attempting to tailor the `wrr` method specifically to fit subsetting use cases. + +This feedback led us to explore broader, more adaptable strategies that could better address the complexities introduced by subsetting, culminating in the current proposal. +## Implementation +DataDog will provide Go and Java implementations. [A51]: A51-custom-backend-metrics.md From 7e14d7239a07e03fa16fe4a8162b337c43c6ec67 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 25 Apr 2024 12:21:29 -0600 Subject: [PATCH 03/13] remove unused proposal --- A80-pid.md | 1 - 1 file changed, 1 deletion(-) diff --git a/A80-pid.md b/A80-pid.md index d400978f2..1b3463a8f 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -24,7 +24,6 @@ The `pid` balancer takes a different approach: instead of deterministically calc ### Related Proposals: * [gRFC A51][A51] -* [gRFC A52][A52] * [gRFC A58][A58] * [gRFC A68][A68] From ff1e03774b3977214ffde67ec2554fce4b7c91ed Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 25 Apr 2024 20:37:00 -0600 Subject: [PATCH 04/13] review comments --- A80-pid.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A80-pid.md b/A80-pid.md index 1b3463a8f..d03dbd02a 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -13,7 +13,7 @@ This document proposes a design for a new load balancing policy called pid. The ## Background -The `wrr` policy uses the following formula to calculate subchannel weights:: +The `wrr` policy uses the following formula to calculate subchannel weights, which is desribed in more details in the "Subchannel Weights" section of [gRFC A58][A58]: $$weight = \dfrac{qps}{utilization + \dfrac{eps}{qps} * error\\_utilization\\_penalty}$$ From 684ed20dfd312a8fb4b26d54ca1890c7097e9281 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 25 Apr 2024 20:38:20 -0600 Subject: [PATCH 05/13] review comments --- A80-pid.md | 29 ----------------------------- 1 file changed, 29 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index d03dbd02a..d03a76aff 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -104,35 +104,6 @@ pidController class { The `update` method is expected to be called on a regular basis, with `samplingInterval` being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to `wrr` weight. - -Your explanation of the PID controller's implementation and pseudo-code looks good, but there are a few typographical errors and clarifications that could be made to improve readability and accuracy. Here's a revised version of your description and pseudo-code: - -A PID controller is a control loop feedback mechanism that continuously calculates an error value as the difference between a desired setpoint (referenceSignal) and a measured process variable (actualSignal). It then applies a correction based on proportional, integral, and derivative terms (denoted P, I, and D respectively), hence the name. - -In our implementation, we will not be using the integral part. The integral component is useful for speeding up convergence when the referenceSignal changes sharply. In our case, we will be converging the load on the subchannels to a mean value, which is mostly stable. - -Here is a sample implementation in pseudo-code: - -pseudo -Copy code -pidController class { - proportionalGain float - derivativeGain float - - controlError float - - update(referenceSignal float, actualSignal float, samplingInterval duration) float { - previousError = this.controlError - // Save last controlError so we can use it to calculate derivative during next update - this.controlError = referenceSignal - actualSignal - controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds() - controlSignal = this.proportionalGain * this.controlError + - this.derivativeGain * controlErrorDerivative - return controlSignal - } -} -The update method is expected to be called on a regular basis, with samplingInterval being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to wrr weight. - The `proportionalGain` and `derivativeGain` parameters are taken from the LB config. `proportionalGain` should be scaled by the `WeightUpdatePeriod` value. This is necessary because `controlErrorDerivative` is inversely proportional to the sampling interval, which in turn is close to the `WeightUpdatePeriod` as we will be updating the PID state once per `WeightUpdatePeriod`. If `WeightUpdatePeriod` is too small, `controlErrorDerivative` becomes too large and dominates the resulting control error. ### Extending WRR balancer From 530beba0643a9c7dc5d9de9b0744fd67d85bb8b5 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 25 Apr 2024 20:43:36 -0600 Subject: [PATCH 06/13] split section --- A80-pid.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index d03a76aff..e84872f35 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -129,8 +129,9 @@ wrrCallbacks interface { // onEDFSchedulerUpdate is called after the wrr balancer recreates the EDF scheduler. onEDFSchedulerUpdate(data callbackData) } - -// Implementation for PID balancer +``` +Here is how `pid` balancer implements this interface. +``` func onSubchannelAdded(subchannelID int, data callbackData) { // Do nothing } From 0d6f76e16b400a6e7baf91d8c7ec795004a0a38b Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 26 Apr 2024 12:56:53 -0600 Subject: [PATCH 07/13] public wrrCallbacks interface --- A80-pid.md | 49 ++++++++++++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index e84872f35..60984769d 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -13,7 +13,7 @@ This document proposes a design for a new load balancing policy called pid. The ## Background -The `wrr` policy uses the following formula to calculate subchannel weights, which is desribed in more details in the "Subchannel Weights" section of [gRFC A58][A58]: +The `wrr` policy uses the following formula to calculate subchannel weights, which is described in more details in the "Subchannel Weights" section of [gRFC A58][A58]: $$weight = \dfrac{qps}{utilization + \dfrac{eps}{qps} * error\\_utilization\\_penalty}$$ @@ -23,9 +23,9 @@ The `pid` balancer takes a different approach: instead of deterministically calc ### Related Proposals: -* [gRFC A51][A51] -* [gRFC A58][A58] -* [gRFC A68][A68] +* [gRFC A51: Custom Backend Metrics Support][A51] +* [gRFC A58: `weighted_round_robin` LB policy][A58] +* [gRFC A68: Random subsetting with rendezvous hashing LB policy.][A68] ## Proposal @@ -104,39 +104,41 @@ pidController class { The `update` method is expected to be called on a regular basis, with `samplingInterval` being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to `wrr` weight. -The `proportionalGain` and `derivativeGain` parameters are taken from the LB config. `proportionalGain` should be scaled by the `WeightUpdatePeriod` value. This is necessary because `controlErrorDerivative` is inversely proportional to the sampling interval, which in turn is close to the `WeightUpdatePeriod` as we will be updating the PID state once per `WeightUpdatePeriod`. If `WeightUpdatePeriod` is too small, `controlErrorDerivative` becomes too large and dominates the resulting control error. +The `proportionalGain` and `derivativeGain` parameters are taken from the LB config. `proportionalGain` should be additionally scaled by the `WeightUpdatePeriod` value. This is necessary because derivative error is calculated like `controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds()` and dividing by a very small `samplingInterval` value makes the result too big. `WeightUpdatePeriod` is roughly equal to `samplingInterval` as we will be updating the PID state once per `WeightUpdatePeriod`. ### Extending WRR balancer -The `pid` balancer reuses 90% of the wrr code. The proposal is to refactor the `wrr` codebase and introduce several hooks that allow other balancers, like `pid`, to reuse the code efficiently without the need for duplication. Initially, these hooks will remain internal to avoid prematurely establishing a new public API. This approach is mostly language-specific, but the general plan is as follows: +The `pid` balancer reuses 90% of the wrr code. The proposal is to refactor the `wrr` codebase and introduce several hooks that allow other balancers, like `pid`, to reuse the code efficiently without the need for duplication. This approach is mostly language-specific, but the general plan is as follows: * Add a `callbacks` object to the `wrr` balancer: This object will contain a series of callback functions that wrr will invoke at various stages of its lifecycle. * Introduce a `callbackData` object: This will be utilized by the callbacks to store any data that is reused across different callback functions. The `wrr` balancer will pass this object to all callbacks and treat it as an opaque blob of data. +* Add `callbackConfig` object to the `wrr` balancer: This object will contain the PID specific part of the user provided config as defined in the `LB Policy Config and Parameters` section. The `wrr` balancer will pass this object to all callbacks and treat it as an opaque blob of data. The `callbacks` object, which is to be provided by the balancer builder, will implement the following interface (expressed in pseudo-code): ``` wrrCallbacks interface { - onSubchannelAdded(subchannelID int, data callbackData) - onSubchannelRemoved(subchannelID int, data callbackData) + onSubchannelAdded(subchannelID int, data callbackData, conf callbackConfig) + onSubchannelRemoved(subchannelID int, data callbackData, conf callbackConfig) // onLoadReport is called when a new load report is received for a given subchannel. // This function returns the new weight for a subchannel. If the returned value is -1, // the subchannel should keep using the old value. // onLoadReport won't be called during the blackout period. - onLoadReport(subchannelId int, data callbackData, conf lbConfig, report loadReport) float + onLoadReport(subchannelId int, report loadReport, data callbackData, conf callbackConfig) float // onEDFSchedulerUpdate is called after the wrr balancer recreates the EDF scheduler. - onEDFSchedulerUpdate(data callbackData) + onEDFSchedulerUpdate(data callbackData, conf callbackConfig) } ``` -Here is how `pid` balancer implements this interface. + +Here is how `pid` balancer implements `wrrCallbacks` interface. ``` -func onSubchannelAdded(subchannelID int, data callbackData) { +func onSubchannelAdded(subchannelID int, data callbackData, conf callbackConfig) { // Do nothing } -func onSubchannelRemoved(subchannelID int, data callbackData) { +func onSubchannelRemoved(subchannelID int, data callbackData, conf callbackConfig) { // Remove subchannelID from two maps that store the value of last utilization // and last applied weight per subchannel delete(data.utilizationPerSubchannel, subchannelID) @@ -144,7 +146,7 @@ func onSubchannelRemoved(subchannelID int, data callbackData) { } -func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadReport, lastApplied time) float { +func onLoadReport(subchannelId int, load loadReport, data callbackData, conf callbackConfig) float { utilization = load.ApplicationUtilization if utilization == 0 { utilization = load.CpuUtilization @@ -163,7 +165,7 @@ func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadR // Prevents corruption of PID controller's internal state, which could happen in the following cases: // * If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. // * If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, - // but PID controller keep growing the weight and it may easily pass the balancing point. + // but the PID controller keeps growing the weights and it may easily pass the balancing point. if time.Since(lastApplied) < conf.WeightUpdatePeriod { return -1 } @@ -171,7 +173,7 @@ func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadR // use value calculated in the onEDFSchedulerUpdate method meanUtilization = data.meanUtilization - // call PID controlelr to get the value of the control signal. + // call the PID controller to get the value of the control signal. controlSignal = data.pidController.update({ referenceSignal: meanUtilization, actualSignal: utilization, @@ -179,7 +181,7 @@ func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadR }) // Normalize the signal. - // If meanUtilization ~= 0 the signal will be ~= 0 as well, and convergence will becoma painfully slow. + // If meanUtilization ~= 0 the signal will be ~= 0 as well, and convergence will become painfully slow. // If, meanUtilization >> 1 the signal may become very high, which could lead to oscillations. if meanUtilization > 0 { controlSignal *= 1 / meanUtilization @@ -188,10 +190,10 @@ func onLoadReport(subchannelId int, data callbackData, conf lbConfig, load loadR lastAppliedWeight = data.lastAppliedWeightPerSubchannel[subchannelID] // Use controlSignal to adjust the weight. - // First caclulate multiplier that will be used to determine how much weight should be changed. + // First calculate a multiplier that will be used to determine how much weight should be changed. // The higher is the absolute value of the controlSignal the more we need to adjust the weight. if controlSignal >= 0 { - // in this case mult should belong to [1,inf) interval, so we will be increasing the weight. + // in this case mult should belong to the [1,inf) interval, so we will be increasing the weight. mult = 1.0 + controlSignal } else { // in this case mult should belong to (0, 1) interval, so we will be decreasing the weight. @@ -225,6 +227,11 @@ func onEDFSchedulerUpdate(data callbackData) { } ``` +The proposal is to make `wrrCallbacks` public. Even though this introduces new public API there are a few reasons that can justify this decision: +* The interface is concise and generic. +* Besides PID there are other cases when people need to fork `wrr`. Spotify [uses](https://www.youtube.com/watch?v=8E5zVdEfwi0) ORCA based custom gRPC load balancer to reduce cross-zone traffic. We are also considering incorporating things like latency and cross-az penalty in our load balancing decisions. With the proposed `wrrCallbacks` interface use-cases like this can be covered easily, as users have full control over LB weights. At the same time users don't have to write their own EDF scheduler and handle details related to subchannel management and interactions with resolvers. +* Existing ORCA extensibility points don't cover such use-cases. We can have custom utilization metric, but what we need is the ability to combine server metrics with the client-side view to generate the resulting weight. + ### Dealing with Oscillations One of the main challenges with the pid balancer is the potential for oscillations. Several factors influence this likelihood: @@ -233,7 +240,7 @@ One of the main challenges with the pid balancer is the potential for oscillatio * **Direct Load Reporting**: Here, the delay depends on the request frequency and the `WeightUpdatePeriod` setting. Typically, with the default `WeightUpdatePeriod` of 1 second, propagation is very fast, making this the preferred option when using the pid balancer. * **OOB Load Reporting**: Users can control the delay by adjusting the `OobReportingPeriod` setting. While the delay is usually larger compared to direct reporting, achieving perfect convergence with OOB reporting is still possible on workloads with stable loads. 2. **Proportional Gain:** - * A high `ProportionalGain` can lead to significant weight adjustments, potentially overshooting the balancing point. The default value of 0.1 generally allows for fast convergence (typically faster than 30 seconds—on workloads that are not spiky) while not generating oscillations. + * A high `ProportionalGain` can lead to significant weight adjustments, potentially overshooting the balancing point. The default value of 0.1 generally allows for fast convergence (typically faster than 30 seconds on workloads that are not spiky) while not generating oscillations. 3. **Stability of Server Load:** * The pid balancer struggles with servers that exhibit spiky loads because the mean utilization is not stable, which disrupts the convergence direction for all subchannels. Unfortunately, this is one aspect users cannot directly control from the client side. To address this, the proposal includes implementing an "average window" mechanism on the server, which will be discussed in the next section. 4. **Number of Subchannels:** @@ -269,7 +276,7 @@ This modification allows for more stable load reporting by averaging fluctuation ## Rationale ### Alternatives Considered: -The main driver for this propsal was the need to implement subsetting. We explored the possibility of using deterministic subsetting in https://github.com/grpc/proposal/pull/383 and got push-back on this for the reasons explained [here](https://github.com/grpc/proposal/pull/383#discussion_r1334587561) +The main driver for this proposal was the need to implement subsetting. We explored the possibility of using deterministic subsetting in https://github.com/grpc/proposal/pull/383 and got push-back on this for the reasons explained [here](https://github.com/grpc/proposal/pull/383#discussion_r1334587561) Additionally, we considered the "scaled wrr" approach, which would adjust the imbalance created by random subsetting by multiplying the server utilization by the number of connections a server receives. Feedback on this approach suggested that it might be more beneficial to pursue more generic solutions that focus on achieving load convergence rather than attempting to tailor the `wrr` method specifically to fit subsetting use cases. From 71cb41df309d2f0595e16a5f2bd4c7ccd2e41245 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 26 Apr 2024 13:35:03 -0600 Subject: [PATCH 08/13] change moving average pseudocode --- A80-pid.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index 60984769d..ca4e5fbc3 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -227,10 +227,7 @@ func onEDFSchedulerUpdate(data callbackData) { } ``` -The proposal is to make `wrrCallbacks` public. Even though this introduces new public API there are a few reasons that can justify this decision: -* The interface is concise and generic. -* Besides PID there are other cases when people need to fork `wrr`. Spotify [uses](https://www.youtube.com/watch?v=8E5zVdEfwi0) ORCA based custom gRPC load balancer to reduce cross-zone traffic. We are also considering incorporating things like latency and cross-az penalty in our load balancing decisions. With the proposed `wrrCallbacks` interface use-cases like this can be covered easily, as users have full control over LB weights. At the same time users don't have to write their own EDF scheduler and handle details related to subchannel management and interactions with resolvers. -* Existing ORCA extensibility points don't cover such use-cases. We can have custom utilization metric, but what we need is the ability to combine server metrics with the client-side view to generate the resulting weight. +The proposal is to make `wrrCallbacks` public. This has a number of significant benefits. Besides PID, there are other cases where one might need to extend `wrr`. For example, Spotify [demonstrates](https://www.youtube.com/watch?v=8E5zVdEfwi0) a gRPC load balancer to reduce cross-zone traffic – this can be implemented nicely in terms of `wrr` weights. We are also considering the same and incorporating things like latency into our load balancing decisions. Existing ORCA extension points don't cover these use cases. We leverage ORCA for custom server utilization metrics, but we also need the ability to combine server and client metrics to generate the resulting weight. The alternative is to write our own balancer with custom EDF scheduler and handle details related to subchannel management and interactions with resolvers. With this new API, use cases like this can be covered naturally, users have full control over the end-to-end definition of weights. ### Dealing with Oscillations @@ -256,15 +253,17 @@ As outlined in the previous section, smoothing the utilization measurements in s func recordMetricXXX(value float) { // Ensure updates are atomic to avoid corruption of the circular buffer lock.Lock() + + if circularBufferForMetricXXX.isFull() { + sum -= circularBufferForMetricXXX.last() + } + sum += val + // Add the new value to the circular buffer, which automatically removes the oldest value if the buffer is full circularBufferForMetricXXX.add(value) - sum = 0 // Calculate the average of the values in the circular buffer - foreach val in circularBufferForMetricXXX { - sum += val - } - metricXXXvalue = sum / circularBufferForMetricXXX.size() + cXXXvalue = sum / circularBufferForMetricXXX.size() lock.Unlock() } ``` From 1786b29a40491ee69e010a922cfddf7cfd3edf4e Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 26 Apr 2024 13:39:35 -0600 Subject: [PATCH 09/13] fix formatting --- A80-pid.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index ca4e5fbc3..4297d0002 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -163,8 +163,8 @@ func onLoadReport(subchannelId int, load loadReport, data callbackData, conf cal // Ensure at least WeightUpdatePeriod has passed since the last update. // Prevents corruption of PID controller's internal state, which could happen in the following cases: - // * If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. - // * If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, + // - If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. + // - If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, // but the PID controller keeps growing the weights and it may easily pass the balancing point. if time.Since(lastApplied) < conf.WeightUpdatePeriod { return -1 @@ -258,7 +258,7 @@ func recordMetricXXX(value float) { sum -= circularBufferForMetricXXX.last() } sum += val - + // Add the new value to the circular buffer, which automatically removes the oldest value if the buffer is full circularBufferForMetricXXX.add(value) From f1d8921a6b4dd18f89591322647c2619a8fe6259 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 26 Apr 2024 13:40:14 -0600 Subject: [PATCH 10/13] fix formatting --- A80-pid.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index 4297d0002..b2e5e1c44 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -163,8 +163,8 @@ func onLoadReport(subchannelId int, load loadReport, data callbackData, conf cal // Ensure at least WeightUpdatePeriod has passed since the last update. // Prevents corruption of PID controller's internal state, which could happen in the following cases: - // - If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. - // - If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, + // 1. If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. + // 2. If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, // but the PID controller keeps growing the weights and it may easily pass the balancing point. if time.Since(lastApplied) < conf.WeightUpdatePeriod { return -1 From 12110b610711bc4d306d628bcd5e424de222ec21 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 26 Apr 2024 13:40:51 -0600 Subject: [PATCH 11/13] fix formatting --- A80-pid.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index b2e5e1c44..33f6bd081 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -163,8 +163,8 @@ func onLoadReport(subchannelId int, load loadReport, data callbackData, conf cal // Ensure at least WeightUpdatePeriod has passed since the last update. // Prevents corruption of PID controller's internal state, which could happen in the following cases: - // 1. If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. - // 2. If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, + // If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. + // If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, // but the PID controller keeps growing the weights and it may easily pass the balancing point. if time.Since(lastApplied) < conf.WeightUpdatePeriod { return -1 From dbc7adcba897527fd73e61eba1e9d30437a9531b Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 26 Apr 2024 13:42:43 -0600 Subject: [PATCH 12/13] fix formatting --- A80-pid.md | 64 +++++++++++++++++++++++++++--------------------------- 1 file changed, 32 insertions(+), 32 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index 33f6bd081..8845df6e1 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -163,54 +163,54 @@ func onLoadReport(subchannelId int, load loadReport, data callbackData, conf cal // Ensure at least WeightUpdatePeriod has passed since the last update. // Prevents corruption of PID controller's internal state, which could happen in the following cases: - // If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. - // If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, - // but the PID controller keeps growing the weights and it may easily pass the balancing point. - if time.Since(lastApplied) < conf.WeightUpdatePeriod { - return -1 - } + // * If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. + // * If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, + // but the PID controller keeps growing the weights and it may easily pass the balancing point. + if time.Since(lastApplied) < conf.WeightUpdatePeriod { + return -1 + } // use value calculated in the onEDFSchedulerUpdate method - meanUtilization = data.meanUtilization + meanUtilization = data.meanUtilization // call the PID controller to get the value of the control signal. - controlSignal = data.pidController.update({ - referenceSignal: meanUtilization, - actualSignal: utilization, - samplingInterval: time.Since(lastApplied), - }) - - // Normalize the signal. - // If meanUtilization ~= 0 the signal will be ~= 0 as well, and convergence will become painfully slow. - // If, meanUtilization >> 1 the signal may become very high, which could lead to oscillations. - if meanUtilization > 0 { - controlSignal *= 1 / meanUtilization - } + controlSignal = data.pidController.update({ + referenceSignal: meanUtilization, + actualSignal: utilization, + samplingInterval: time.Since(lastApplied), + }) + + // Normalize the signal. + // If meanUtilization ~= 0 the signal will be ~= 0 as well, and convergence will become painfully slow. + // If, meanUtilization >> 1 the signal may become very high, which could lead to oscillations. + if meanUtilization > 0 { + controlSignal *= 1 / meanUtilization + } lastAppliedWeight = data.lastAppliedWeightPerSubchannel[subchannelID] // Use controlSignal to adjust the weight. // First calculate a multiplier that will be used to determine how much weight should be changed. // The higher is the absolute value of the controlSignal the more we need to adjust the weight. - if controlSignal >= 0 { + if controlSignal >= 0 { // in this case mult should belong to the [1,inf) interval, so we will be increasing the weight. - mult = 1.0 + controlSignal - } else { + mult = 1.0 + controlSignal + } else { // in this case mult should belong to (0, 1) interval, so we will be decreasing the weight. - mult = -1.0 / (controlSignal - 1.0) - } - weight = lastAppliedWeight * mult + mult = -1.0 / (controlSignal - 1.0) + } + weight = lastAppliedWeight * mult // Clamp weight - if weight > conf.MaxWeight { - weight = conf.MaxWeight - } - if weight < conf.MinWeight { - weight = conf.MinWeight - } + if weight > conf.MaxWeight { + weight = conf.MaxWeight + } + if weight < conf.MinWeight { + weight = conf.MinWeight + } // Save resulting utilization and weight. - data.utilizationPerSubchannel[subchannelId] = utilization + data.utilizationPerSubchannel[subchannelId] = utilization data.lastAppliedWeightPerSubchannel[subchannelID] = weight return weight From a1fe191ecb97940c886427da46546ab149fd3cee Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 26 Apr 2024 13:59:01 -0600 Subject: [PATCH 13/13] add link to the discussion --- A80-pid.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A80-pid.md b/A80-pid.md index 8845df6e1..adef81b2a 100644 --- a/A80-pid.md +++ b/A80-pid.md @@ -4,8 +4,8 @@ A68: PID LB policy. * Approver: * Status: Draft * Implemented in: PoC in Go -* Last updated: 2024-04-18 -* Discussion at: +* Last updated: 2024-04-26 +* Discussion at: https://groups.google.com/g/grpc-io/c/eD2bE2JzQ2w ## Abstract