title | authors | reviewers | approvers | creation-date | |||
---|---|---|---|---|---|---|---|
Federated HPA |
|
|
|
2023-02-14 |
Horizontal Pod Autoscaling (HPA) is a popular method for stabilizing applications by adjusting the number of pods based on different metrics. However, in single cluster, the scaling fail recovery couldn't be handled. Also, in the era of multi-clusters, single cluster HPA may not meet the growing requirements, including scaling across clusters, unified HPA configuration management, scaling fail recovery, and limiting resource thresholds across clusters.
To address these issues, this proposal introduced FederatedHPA to Karmada. This solution minimizes user experience discrepancies between HPA in a single cluster and multi-cluster environments. With FederatedHPA, users can easily scale their applications across multiple clusters and improve the overall efficiency of their Kubernetes clusters.
In the current era of single clusters, Horizontal Pod Autoscaling (HPA) is commonly used to dynamically scale workload replicas to handle incoming requests and enhance resource utilization. Therefore, in the era of multi-clusters, HPA should still work and be capable of scaling the workloads in different member clusters, to resolve resource limitations for a single cluster and improve scaling fail recovery capabilities.
By introducing FederatedHPA to Karmada, users can enable various scenarios with rich policies. For instance, workloads in the high priority clusters may require scaling before workloads in low priority clusters. FederatedHPA provides greater flexibility and allows for more precise control over workload scaling across multiple clusters.
- Define FederatedHPA API to implement unified configuration management for member clusters HPA.
- Support both Kubernetes HPA(CPU/Memory) and customized HPA(customized metrics).
- Tolerate the disaster of member clusters or Karmada control plane.
- Minimize control plane overhead when scaling workloads across clusters, and maximize the utilization of member clusters' scaling capabilities.
- Allow for fine-grained control over the scaling of workloads across multiple clusters.
- Consider cluster-autoscaler, scaling across clusters could be delayed after some period.
As an administrator or platform engineer, I would like to have control over whether to federate existing HPA (Horizontal Pod Autoscaler) resources after a single cluster is connected to a federation cluster. If not, existing workloads in the single cluster should still be able to scale normally using HPA without being affected by the federation. If yes, the federation process needs to be smooth enough to achieve elastic use of resources across multiple clusters and unified management of HPA in the member clusters.
As an administrator or platform engineer, I hope to leverage HPA to ensure workloads flexibility in a multi-cluster service scenario. And I hope to be able to configure member clusters' HPA based on the clusters' status, achieving fine-grained scaling control. The following are some scenarios:
- If I have two clusters, A and B, with A having more resources and B having fewer, then I hope to be able to configure more instances to be scaled up in cluster A to improve overall resource utilization across multiple clusters.
- I have services deployed in both local clusters and cloud clusters. As the service load increases, I hope to prioritize scaling up the service deployed in the local clusters. When the resources of the local cluster are insufficient, I hope to scale up the workloads deployed in the clusters. As the service load decreases, I hope to prioritize scaling down the service deployed in the cluster to reduce cloud costs.
As an administrator or platform engineer, I can ensure service flexibility with the help of Federated HPA in a multi-cluster service scenario. Additionally, I hope to achieve scaling disaster recovery by migrating the HPA and services of failed clusters, which will improve service continuity in the event of a cluster failure.
As an administrator or platform engineer, I can ensure service flexibility with the help of Federated HPA in a multi-cluster service scenario. Additionally, I hope to be able to configure the upper limit of instances for multi-cluster applications, thereby limiting the resource consumption of this application and avoiding exceeding the expected multi-cluster cloud costs. This will also prevent resource congestion that may cause service disruptions for other applications.
- The workloads/pods in different member clusters selected by the same HPA CR/resource share the load of the application equally. For example, 10 pods of the application are spread into two member clusters with distribution
cluster1: 3 pods, cluster2: 7 pods
, so the 3 pods in cluster1 take 3/10 of total requests and 7 pods in cluster2 take 7/10 of total requests. Scenarios that don't meet the restriction are not considered in this proposal. - If the system doesn't meet the first point's requirement, some clusters' services may be overloaded, while some clusters' services may be underloaded.
- Karmada FederatedHPA provides full capabilities to scale instances over multiple clusters by controlling the min/max and scale sub-resources. However, with different architectures, scaling may be implemented in other ways, such as using traffic routing to implement scaling based on cluster priority. In this case, Karmada FederatedHPA is only responsible for propagating HPA. (Trip.com Group Limited)
If there is already an HPA resource in Karmada, users will need to delete the existing HPA resource and apply the new FederatedHPA API.
From the above architecture, FederatedHPA controller should be responsible for the following things:
- Based on the FederatedHPA configuration, assign the minReplicas/maxReplicas of HPA to member clusters by creating works.
- When HPA initial assignment, the controller should update the workload's ResourceBinding based on the assignment results.
- If there is a priority policy, the FederatedHPA controller should scale down the low-priority cluster first, if there is no priority policy, the FederatedHPA controller should not care about the scaling down operation.
- If the cluster's K8s version under 1.23, HPA resource version will be
autoscaling/v2beta2
, if not, the version will beautoscaling/v2
. (autoscaling/v2beta2
is supported after 1.11,autoscaling/v2
is supported after 1.22)
FederatedHPA has its own characteristics, such as cross-cluster scaling execution delay. To make Federated HPA more flexible and easy to extend, we have introduced the Federated HPA API, which enables the implementation of scaling with a wide range of policies.
type FederatedHPA struct {
metav1.TypeMeta
metav1.ObjectMeta
// Spec defines the desired behavior of federated HPA.
Spec FederatedHPASpec `json:"spec,omitempty"`
// Status of the current information about the federated HPA.
// +optional
Status FederatedHPAStatus `json:"status,omitempty"`
}
type FederatedHPASpec struct {
autoscalingv2.HorizontalPodAutoscalerSpec `json:",inline"`
// ClusterAffinity represents scheduling restrictions to a certain set of clusters.
// If not set, any cluster can be scheduling candidate.
// +optional
ClusterAffinity *policyv1alpha1.ClusterAffinity `json:"clusterAffinity,omitempty"`
// AutoscaleMultiClusterDelay is the delay for executing autoscaling action in other clusters once
// the current cluster could not scale the workload. It is used to avoid the situation that cluster-autoscaler is used.
// +optional
AutoscaleMultiClusterDelay *int32 `json:"autoscaleMultiClusterDelay,omitempty"`
// ScalingToZero indicates whether the workload can be scaled to zero in member clusters.
// +optional
ScalingToZero bool `json:"scalingToZero,omitempty"`
// FederatedHPAAssignment is the assignment policy of FederatedMinReplicas/FederatedMaxReplicas
// +required
FederatedHPAAssignment FederatedHPAAssignment `json:"federatedHPAAssignment,omitempty"`
}
// FederatedHPAAssignmentType describes assignment methods for FederatedHPA FederatedMinReplicas/FederatedMaxReplicas
type FederatedHPAAssignmentType string
const (
// FederatedHPAAssignmentTypeDuplicated means when assigning a FederatedHPA,
// each candidate member cluster will directly apply the original FederatedMinReplicas/FederatedMaxReplicas.
FederatedHPAAssignmentTypeDuplicated FederatedHPAAssignmentType = "Duplicated"
// FederatedHPAAssignmentTypeAggregated means when assigning a FederatedHPA,
// divides FederatedMinReplicas/FederatedMaxReplicas into clusters' HPA as few as possible,
// while respecting clusters' resource availabilities during the division.
FederatedHPAAssignmentTypeAggregated FederatedHPAAssignmentType = "Aggregated"
// FederatedHPAAssignmentTypeStaticWeighted means when assigning a FederatedHPA,
// divides FederatedMinReplicas/FederatedMaxReplicas into clusters' HPA with static weight.
FederatedHPAAssignmentTypeStaticWeighted FederatedHPAAssignmentType = "StaticWeighted"
// FederatedHPAAssignmentTypeDynamicWeighted means when assigning a FederatedHPA,
// divides FederatedMinReplicas/FederatedMaxReplicas into clusters' HPA with dynamic weight, while clusters' resource availabilities represent the dynamic weight.
FederatedHPAAssignmentTypeDynamicWeighted FederatedHPAAssignmentType = "DynamicWeighted"
// FederatedHPAAssignmentTypeDynamicWeighted means when assigning a FederatedHPA,
// divides FederatedMinReplicas/FederatedMaxReplicas into clusters' HPA with dynamic weight, while clusters' resource availabilities represent the priority.
FederatedHPAAssignmentTypePrioritized FederatedHPAAssignmentType = "Prioritized"
)
type FederatedHPAAssignment struct {
// FederatedHPAAssignmentType determines how FederatedMinReplicas/FederatedMaxReplicas is assigned to member clusters.
// Valid options are Duplicated, Aggregated, StaticWeighted, DynamicWeighted and Prioritized.
// +kubebuilder:validation:Enum=Duplicated;Aggregated;StaticWeighted;DynamicWeighted;Prioritized
// +kubebuilder:default=Duplicated
FederatedHPAAssignmentType FederatedHPAAssignmentType `json:"federatedHPAAssignmentType,omitempty"`
// ClusterAutoscalingPreference describe the priority/static weight for each cluster.
// If FederatedHPAAssignmentType is set to "StaticWeighted"/"Prioritized"
// +optional
ClusterAutoscalingPreference []TargetClusterAutoscalingPreference `json:"clusterAutoscalingPreference,omitempty"`
}
type TargetClusterAutoscalingPreference struct {
// TargetCluster describes the filter to select clusters.
// +required
TargetCluster *policyv1alpha1.ClusterAffinity `json:"targetCluster,omitempty"`
// AutoscalingMaxThreshold is the upper limit for the number of replicas for member clusters, it only works when the FederatedHPAAssignmentType is Aggregated/DynamicWeighted/Prioritized.
// +optional
// AutoscalingMaxThreshold int32 `json:"autoscalingMaxThreshold,omitempty"`
// AutoscalingMinThreshold is the lower limit for the number of replicas for member clusters, it only works when the FederatedHPAAssignmentType is Aggregated/DynamicWeighted/Prioritized.
// +optional
// AutoscalingMinThreshold int32 `json:"autoscalingMinThreshold,omitempty"`
// StaticWeight is the weight for the target cluster, it only works when the FederatedHPAAssignmentType is StaticWeighted.
// +optional
StaticWeight *int32 `json:"staticWeight,omitempty"`
// Priority is the priority for the target cluster, it only works when the FederatedHPAAssignmentType is Prioritized.
// +optional
Priority *int32 `json:"priority,omitempty"`
}
type FederatedHPAStatus struct {
// AggregatedStatus contains the assign results and the status of HPA.
AggregatedStatus []autoscalingv2.HorizontalPodAutoscalerStatus `json:"aggregatedStatus,omitempty"`
}
- Once the FederatedHPA is applied for the workload, Karmada scheduler should not work for this workload anymore.
- Once the FederatedHPA is applied for workload, the replicas changes in member clusters should be retained.
When FederatedHPA is applied for the first time or updated, the controller will initialize the HPA assignments to member clusters. The assignments should follow the policy configuration. The following contents describe how to assign the HPA to member clusters with different policies.
With this policy, FederatedHPA will assign the same minReplicas/maxReplicas(equal to FederatedHPA's maxReplicas/minReplicas) to all member clusters.
So, FederatedHPA controller will create the HPA's works in which the minReplicas/maxReplicas are equal to FederatedHPA's maxReplicas/minReplicas in all member clusters. And also, FederatedHPA controller will update the ResourceBinding to delete workload from clusters or assign workload to clusters.
Suppose we have the following configuration:
# FederatedHPA Configuration
minReplicas: 3
maxReplicas: 10
ClusterAffinity:
clusterNames:
- member1
- member2
- member3
- member5
---
# ResourceBinding
clusters:
- name: member1
replicas: 1
- name: member2
replicas: 4
- name: member3
replicas: 20
- name: member4
replicas: 5
After the assignment, the result will be:
# member 1/2/3 HPA
minReplicas: 3
maxReplicas: 10
# ResourceBinding
clusters:
- name: member1
replicas: 1
- name: member2
replicas: 4
- name: member3
replicas: 20
- name: member5
replicas: 3
We can see member4 is deleted from ResourceBinding, which means once the FederatedHPA is applied, the selection result of Karmada scheduler will be ignored.
With this policy, FederatedHPA will assign the minReplicas/maxReplicas to all member clusters based on the static weight configuration.
Suppose we have the following configuration:
# FederatedHPA Configuration
minReplicas: 2
maxReplicas: 10
ClusterAffinity:
clusterNames:
- member1
- member2
- member3
ScaleToZero: {scaletozero_config}
ClusterAutoscalingPreference:
- targetCluster:
clusterNames:
- member1
staticWeight: 1
- targetCluster:
clusterNames:
- member2
staticWeight: 2
- targetCluster:
clusterNames:
- member3
staticWeight: 3
---
# ResourceBinding
clusters:
- name: member1
replicas: 1
- name: member2
replicas: 4
After the assignment, the result of the HPA assignment will be:
#member3 HPA
minReplicas: 1
maxReplicas: 5
#member2 HPA
minReplicas: 1
maxReplicas: 4
#member1 HPA
minReplicas: 1
maxReplicas: 1
So if after calculation, the minReplicas is less than 1 but maxReplicas is bigger or equal to 1, minReplicas should be 1.
With different ScaleToZero
configuration, ResourceBinding
updating will be different, the configuration should depends on your multi-cluster architecture:
If ScaleToZero
is true, after the assignment, the ResourceBinding will be:
# ResourceBinding
clusters:
- name: member1
replicas: 1
- name: member2
replicas: 4
- name: member3
replicas: 0
If ScaleToZero
is false, after the assignment, the result will be:
# ResourceBinding
clusters:
- name: member1
replicas: 1
- name: member2
replicas: 4
- name: member3
replicas: 1
With this policy, FederatedHPA will assign the minReplicas/maxReplicas to all member clusters based on the dynamic weight configuration. The dynamic factor only can be availableReplicas.
The initial assignment behavior is similar to the StaticWeighted assignment policy. The only difference is that the dynamic weight is calculated based on the availableReplicas of member clusters.
Suppose we have the following configuration:
# FederatedHPA Configuration
minReplicas: 8
maxReplicas: 24
ClusterAffinity:
clusterNames:
- member1
- member2
- member3
ScaleToZero: {scaletozero_config}
ClusterAutoscalingPreference:
hpaAssignmentPolicy: DynamicWeighted
# Cluster availableReplicas
member1: 1
member2: 5
member3: 2
After the assignment, the result of the HPA assignment will be:
#member 1
minReplicas: 1
maxReplicas: 3
#member 2
minReplicas: 5
maxReplicas: 15
#member 3
minReplicas: 2
maxReplicas: 6
The resource binding will be updated after the assignment, same with StaticWeighted assignment policy
.
With this policy, FederatedHPA will assign the minReplicas/maxReplicas to all member clusters based on the aggregated status of member clusters.
Suppose we have the following configuration:
minReplicas: 8
maxReplicas: 24
ClusterAffinity:
clusterNames:
- member1
- member2
- member3
ScaleToZero: {scaletozero_config}
ClusterAutoscalingPreference:
hpaAssignmentPolicy: Aggregated
# Cluster availableReplicas
member1: 8
member2: 2
member3: 2
So after the assignment, the result of the HPA assignment will be:
#member 1
minReplicas: 8
maxReplicas: 18 #8+10, after first assignment cycle, 10 replicas is left.
#member 2
minReplicas: 1
maxReplicas: 2
#member 3
minReplicas: 1
maxReplicas: 2
The resource binding will be updated after the assignment, same with StaticWeighted assignment policy
.
With this policy, FederatedHPA will assign the minReplicas/maxReplicas to member clusters based on the priority configuration.
Suppose we have the following configuration:
#FederatedHPA Configuration
minReplicas: 8
maxReplicas: 24
ClusterAffinity:
clusterNames:
- member1
- member2
- member3
ScaleToZero: {scaletozero_config}
ClusterAutoscalingPreference:
hpaAssignmentPolicy: Prioritized
hpaAssignmentPolicyConfiguration:
targetCluster:
- clusterNames:
- member1
priority: 2
- clusterNames:
- member2
priority: 1
#Cluster availableReplicas
member1: 20
member2: 1
#ResourceBinding
clusters:
- name: member1
replicas: 15
If ScaleToZero
is true, the result of the HPA assignment will be:
#member 1
minReplicas: 8
maxReplicas: 23
#member 2
minReplicas: 1
maxReplicas: 1
#ResourceBinding
clusters:
- name: member1
replicas: 15
- name: member2
replicas: 0
If ScaleToZero
is false, the result of the HPA assignment will be:
#member 1
minReplicas: 8
maxReplicas: 23
#member 2
minReplicas: 1
maxReplicas: 1
#ResourceBinding
clusters:
- name: member1
replicas: 15
- name: member2
replicas: 1
With this design architecture, the behavior of scaling across cluster containers has two parts: Scaling Up/Scaling down. And also, to make the FederatedHPA controller scaling work(no conflicts), FederatedHPA controller only can scale up the workload in member clusters.
For Duplicated/StaticWeighted/DynamicWeighted/Aggregated policies:
- FederatedHPA controller should not do anything when scaling down, let the HPA controller works in the member clusters.
- FederatedHPA controller should not do anything when scaling up, the reason is: When scaling up one cluster, all other clusters are scaled up simultaneously, which leads to the distribution of additional resource requests to the clusters that have optimal scaling. As a result, any pending pods in a scaled cluster due to insufficient resources are resolved over time through the allocation of additional resources from the other scaled clusters.
Once the clusters with highest priority haven't got resources(pending pods), Karmada FederatedHPA controller should update the HPA in the cluster with second highest priority.
Suppose we have the following configuration:
#member 1
minReplicas: 8
maxReplicas: 23
currentReadyPods: 10
currentPendingPods: 6
priority: 2
#member 2
minReplicas: 1
maxReplicas: 1
currentReadyPods: 1(ScaleToZero=false)/0(ScaleToZero=true)
priority: 1
So, FederatedHPA controller should update the HPA to the following configuration:
#member 1
minReplicas: 8
maxReplicas: 10
currentReadyPods: 10
priority: 2
#member 2
minReplicas: 1
maxReplicas: 14
currentReadyPods: 1 #No matter what configuration, it should updated to 1.
priority: 1
If this is triggered by pending pods, it should execute the operation after AutoscaleMultiClusterDelay
.
In general, if one cluster is scaling down, all the member clusters are scaling down the same workload. FederatedHPA controller should update the high-priority cluster's scale resource with the same replicas, but let the low priority cluster's HPA controller scale the replicas down first.
For example, there is the following cluster's status:
targetUtilization: 50%
cluster 1:
current replicas: 4
maxReplicas: 4
current Utilization: 25
priority: 3
minReplicas: 1
cluster 2:
current replicas: 9
maxReplicas: 9
current Utilization: 25
currentAvailableReplicas: 3
priority: 2
minReplicas: 1
cluster 3:
current replicas: 4
maxReplicas: 1
current Utilization: 25
currentAvailableReplicas: 1
priority: 1
minReplicas: 1
So the steps will be:
- FederatedHPA controller should update cluster-2/cluster-3's scale resource with 4/9 replicas, and cluster-3's HPA controller will scale the replicas to 2.
- After cluster-3's replicas are equal to 1, FederatedHPA controller will only update cluster-1's scale resource, and let cluster-2's HPA controller works normally.
- After cluster-3's replicas are equal to 1, set maxReplicas as 1 in cluster 3 and set maxReplicas as 7 in cluster 1.
The resource state of member clusters is changed with time, so we should optimize the HPA minReplicas/maxReplicas, to implement:
- Maximum the scaling ability of member clusters: If the HPA minReplicas/maxReplicas is not suitable for member clusters' state, the scaling operation may failed because of the limitation of maxReplicas.
- Better to tolerate Karmada control plane disaster: If Karmada control plane is down, the member clusters' HPA controller could scale the workload better and the resource will have a bigger utilization.
PS: This only works for Aggregated/DynamicWeighted/Prioritized policy.
So the key optimization way is:
- Update all the member clusters' maxReplicas to the current workload replicas.
- Sum all clusters (maxReplicas - current replicas).
- Reassign the sum to all member cluster's maxReplicas based on the policy.
For example, the StaticWeighted policy is used:
#cluster 1
staticWeight: 2
current replica: 6
maxReplicas: 7
#cluster 2
staticWeight: 1
current replica: 6
maxReplicas: 8
#cluster 3
staticWeight: 1
current replica: 6
maxReplicas: 7
The sum is (7-6)+(8-6)+(7-6)=4, so the new HPA of clusters should be:
#cluster 1
staticWeight: 2
current replica: 6
maxReplicas: 8
#cluster 2
staticWeight: 1
current replica: 6
maxReplicas: 7
#cluster 3
staticWeight: 1
current replica: 6
maxReplicas: 7
The purpose of using Sum(Max Replicas - Current Replicas)
is to avoid overloading.
About minReplicas, the optimization condition is: when the workload replicas are zero, It will reassign this cluster HPA to all existing clusters(including itself).
When there are clusters that could not scale the workloads up(pending), FederatedHPA controller should reassign (maxReplicas - not pending replicas) to other clusters' HPA maxReplicas, and also change the HPA maxReplicas to the current replicas.
This behavior only exists when ScaleToZero
is true.
After initial assignments and scaling operation, the replicas in the member cluster are all equal to HPA minReplicas, but the sum of minReplicas is larger than FederatedHPA/minReplicas and the utilization is smaller than the target. For this scenario, FederatedHPA controller should scale some clusters' replicas to 0.
So the key point is how to select the target clusters:
- Duplicated policy: FederatedHPA controller shouldn't do anything.
- Aggregated policy: Select the one with the smallest replicas.
- StaticWeighted policy: Select the one with the smallest static weight.
- DynamicWeighted policy: Select the one with the smallest dynamic weight.
- Prioritized policy: Select the one with the smallest priority.
After step Scaling to zero
, when meeting request burst, FederatedHPA controller should scale up the clusters from zero replicas.
So the key point is how to select the target clusters:
- Duplicated policy: FederatedHPA controller shouldn't do anything.
- Aggregated policy: Select all the clusters with zero replicas.
- StaticWeighted policy: Select all the clusters with zero replicas.
- DynamicWeighted policy: Select all the clusters with zero replicas.
- Prioritized policy: Select the one with the highest priority but replicas is 0.
If one member cluster fails(or taint), Karmada FederatedHPA controller should reassign the HPA of the failed cluster to other clusters based on the policy. The logic should be the same with workload failover.
With the scaling operation in different layers(member clusters and Karmada control plane), scaling across clusters is highly available. When member clusters could scale the workloads by themselves, Karmada control plane will not help to scale the workloads across multi-cluster. When member clusters could not scale the workloads(pending pods or other), Karmada control plane will help to scale the workloads in other member clusters. So When Karmada control plane is down, the HPA controller(K8s native) should still work in the member clusters, and it just loses the capability to scale the workloads in other clusters.
This feature is quite huge, so we will implement it in four stages:
- Implement the API definition and initial HPA assignment to member clusters with duplicated and static weighted policy.
- Implement initial HPA assignment for dynamic weighted policy, aggregated policy, and prioritized policy.
- Implement autoscaling across clusters with priority policy with scaling up/down.
- Implement the optimization of HPA in member clusters after assigning different policies.
- Implement scaling to/from zero with different policies in member clusters.
- AutoscalingMaxThreshold/AutoscalingMinThreshold is quite complex, give another proposal in the future.
- All current testing should be passed, no break change would be involved by this feature.
- Add new E2E test cases to cover the new feature.
- Initial HPA assignment to member clusters with different policies.
- Scaling across clusters with priority policy with scaling up/down.
- Optimize the HPA in member clusters after assigning different policies.
- Scaling to/from zero with different policies in member clusters.
Enhancing the ability of PP is an alternative. But this approach will make things complex:
- It will take a lot of effort to enhance PP's ability.
- Both PP and HPA are difficult to evolve because both need to consider their impact on each other.
- HPA has its own characteristics, and it is not suitable to use PP to cover all of the characteristics.