-
Notifications
You must be signed in to change notification settings - Fork 47
2 worker pools failing to create cluster in AWS #839
Comments
CC @johananl |
For more context pr-aws |
If we want to route HTTP traffic to multiple worker pools, I think we have the following options:
I think this is more of a design question than a bug. We should think about the functionality we want to have and then build things accordingly. |
BTW, the failure isn't related to Bootkube. |
yes, so if we want users to work with the present config. We should atleast document it. But what if user creates 3 pools. Then they have to set |
We can also use the CCM to set up the load balancing for us.
The bug is, that we do not validate if the 2 pools have colliding ports and we do not document this (implied) requirement. |
From a quick glance at the Terraform code, I have a feeling, that with current implementation, it would make sense to move the Then, if This way, we fix the colliding 2 worker pools and don't modify existing behavior. Alternatively, we could just reject such configuration for now and document the limitation. We should also mention, that if only one worker pool handles the ingress traffic, node selector and labels should be used on this pool and contour component, so contour (or envoy?) pods only run on this worker pool. Finally, I think ideally we should be just using the CCM to set up the Load Balancers for us and remove this all code from Terraform, as IMO it doesn't really "fit" into Kubernetes setup. |
I think it makes more sense to expose the load balancer's ID as an output in the controllers module and have each worker pool create its own target group and attach itself to the LB. My rationale:
|
One advantage of the current setup is that it allows the user to utilize a single NLB for both the API server and user applications. I like relying on CCM for LBs in general, however we would still need an LB for the API server. |
Sure, I can take a look into this |
Following an offline discussion with @knrt10, here is what we're proposing: We deploy an NLB for the API server (let's call it the "API NLB" here). By default, we don't provision LB resources for worker pools. If they so choose, the user can enable LB provisioning at the worker pool level: cluster "foo" {
...
worker_pool "one" {
# Provision LB resources for this pool.
load_balancing {
mapping {
protocol = "TCP"
listener_port = 80
target_port = 30080
}
mapping {
protocol = "TCP"
listener_port = 443
target_port = 30443
}
}
}
worker_pool "two" {
# Provision LB resources for this pool.
load_balancing {
mapping {
protocol = "TCP"
listener_port = 7777
target_port = 8888
}
mapping {
protocol = "UDP"
listener_port = 12001
target_port = 12001
}
}
}
worker_pool "three" {
# Don't provision LB resources for this pool.
...
}
} When the
In the docs we instruct the user to create a NodePort service with the same node ports as specified in the target port knobs. When the In addition to the above, the user is free to deploy services of type LoadBalancer (optionally with annotations to control the type and config of the created LB) and we ensure a CCM is present to provision load balancers on behalf of the user. Caveats which should be documented:
Any feedback is welcome. |
@johananl thanks for the proposal. I think it make sense, though it is very similar to what we have right now. But it seems to be that implementation might be too complex to include that in v0.4.0. Maybe we just implement a validation rule to reject conflicting ports right now (including the defaults) and we schedule that for later? |
I have no opinion regarding which release to put this in. I'm fine with either rejecting the "bad" config for now or even just documenting the constraints. |
I think if we want to do this, we should do this correctly and for 0.4.0, we should just document the constraints for using different |
I'd implement a validation rule for it. It's not difficult to do so and will prevent confusing behavior (users tend not to read the documentation 😄) |
I like this idea. I don't think there is harm in having two LBs for HA reasons. If we really want to use one LB for both app traffic and control plane traffic, do we have knobs in CCM where it can do the magic for us? |
I don't think you can modify an existing LB using the CCM. AFAIK the AWS CCM either owns an LB or doesn't. AFAICT if we want to allow using a single NLB for both the API server and application traffic, the CCM can't help there. |
I'm just coming back from holidays, but I think:
However, I might be missing some context. In particular, why options like this were discarded. I can see some problems and some work-arounds to such approach:
Wouldn't something like this work and be simpler than the following proposals? Is there a downside this have that I'm not aware of? After such option was discarded, the next proposa seems to have several downside IMHO:
But, IIUC, there is a simpler way to do this if the CCM provision load balancer for us: just use service type load balancer with the annotations documented in contour docs. IOW, do what is explained here: https://projectcontour.io/guides/deploy-aws-nlb/. IMHO, I don't know why the first option was discarded, but seems simpler and have a better UX, IMHO. The second option (rely on CCM) if that is working, seems to be the best way (and if CCM is not yet working for this, might be the best way when that is working). In any case, if either of the two options I propose was discarded for good reasons (I really lack the information to know :-D) or there is no consensus in the short term, I'm totally fine having a validation rule for now that will the original bug report not fail at the terraform layer, but give a clear error message in lokoctl. Other questions due to lack of context:
Thanks! |
@rata we are using an NLB for the API server on AWS:
This is Typhoon heritage. |
Given that #889 is merged and we no longer fail to create 2 worker pools, I would like to close this issue. However, I think information about LBs on AWS here are valuable. @johananl @surajssd @knrt10 could one of you capture the essence of the discussion here and create a new |
I'll close this then and we continue the discussion in #1248. |
Use the following config for worker pool
Bootkube starts but then it fails with following error.
The text was updated successfully, but these errors were encountered: