Fix AWS worker pool handling #367

johananl · 2020-04-27T17:44:57Z

General

Fixes #349.

This PR fixes the handling of worker pools on AWS. Following is a summary of the changes:

There used to be an "invisible" worker pool of size 0 that was created with every cluster. This pool is now removed.
NLB listeners are now created per worker pool and are correctly wired to the right target groups. Before this change there used to be one "global" set of listeners which were wired to the empty worker pool, practically blackholing all ingress traffic to the cluster.
NLB listener ports for HTTP/HTTPS are now configurable with a default of 80/443, respectively. This is required in order to support an arbitrary number of worker pools with ingress using the same NLB.
Target group names have been cleaned. These are limited to 32 characters by the AWS API so we want to keep them short. Related issue: Handle AWS ELB target group names > 32 characters #366

Testing

To test this PR, deploy a cluster with a config similar to the following

cluster "aws" {
  asset_dir        = "./assets"
  cluster_name     = "my-cluster"
  controller_count = 1
  dns_zone         = "example.com"
  dns_zone_id      = "xxxxxxxx"
  ssh_pubkeys      = ["ssh-rsa ..."]

  worker_pool "pool-1" {
    instance_type = "t3a.small"
    count         = 2
    ssh_pubkeys   = ["ssh-rsa ..."]
  }
}

component "contour" {
  service_type = "NodePort"
}
component "cert-manager" {
  email = "me@example.com"
}
component "httpbin" {
  ingress_host = "httpbin.example.com"
}

httpbin should be available at https://httpbin.example.com.

johananl · 2020-04-27T18:18:52Z

We're hitting the size limit of the ELB target group name (32 chars).

johananl · 2020-04-27T19:28:14Z

We're hitting the size limit of the ELB target group name (32 chars).

Fixed.

invidian

Just some small nits, otherwise LGTM. Nice work @johananl

assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/variables.tf

assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf

assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/nlb.tf

docs/configuration-reference/platforms/aws.md

ci/aws/aws-cluster.lokocfg.envsubst

surajssd · 2020-04-28T08:52:16Z

httpbin should be available at https://httpbin.example.com.

@johananl I was manually testing this PR, where should httpbin.example.com be pointing to in Route53? This is a manual step right?

johananl · 2020-04-28T10:30:44Z

httpbin should be available at https://httpbin.example.com.

@johananl I was manually testing this PR, where should httpbin.example.com be pointing to in Route53? This is a manual step right?

Lokomotive creates a wildcard record which routes *.example.com to the NLB. You shouldn't need to touch Route 53 manually. My testing instructions should work exactly as written.

invidian · 2020-04-28T11:15:43Z

Lokomotive creates a wildcard record which routes *.example.com to the NLB. You shouldn't need to touch Route 53 manually. My testing instructions should work exactly as written.

@johananl if this should work fully automatically, perhaps we should add e2e test for it? Which just does the HTTP request which should succeed?

johananl · 2020-04-28T11:38:04Z

Lokomotive creates a wildcard record which routes *.example.com to the NLB. You shouldn't need to touch Route 53 manually. My testing instructions should work exactly as written.

@johananl if this should work fully automatically, perhaps we should add e2e test for it? Which just does the HTTP request which should succeed?

I can look into that, although I wanted to keep the scope of this PR small and merge it ASAP as things are currently broken.

surajssd · 2020-04-28T12:44:33Z

I can look into that, although I wanted to keep the scope of this PR small and merge it ASAP as things are currently broken.

This works for me. Only one difference from deploying on packet is that on packet we need to create a manual entry in route53 and here user is free to choose whatever URL they want.

On AWS we need to make sure that your the URL is <app-name>.<cluster-name>.<dns-zone>.

johananl · 2020-04-28T13:18:40Z

Only one difference from deploying on packet is that on packet we need to create a manual entry in route53 and here user is free to choose whatever URL they want.

Opened #370 to track this.

On AWS we need to make sure that your the URL is <app-name>.<cluster-name>.<dns-zone>.

Opened #371 to track this.

surajssd

Works for me.

rata · 2020-05-11T14:08:54Z

@johananl This LGTM, as I said in the previous reviews, I don't think I need to review again, the changes that might receive this PR are very small.

Regarding the open comment, as I mentioned, mark it solved when you consider it solved (AFAICS you found problems to test your solution, I think whatever you consider is reasonable at this point, is fine). Regarding the other comments, they are nit-picking so whatever you decide (even leave it as it is), is fine by me :)

If you need, of course, just ping me for a LGTM :)

johananl · 2020-05-11T19:42:22Z

All feedback was handled. Please check again.

rata

LGTM. Thanks a lot @johananl, this improvement is really important! :)

assets/components/contour/values.yaml

It should be easy to figure out which cluster an AWS resource belongs to. - Prefix EC2 `Name` tags with cluster name. - Prefix ASG name with cluster name. - Prefix ASG launch config with cluster and pool name. - Prefix LB target groups with cluster and pool name.

Worker pools are managed in a dedicated module. Invoking the workers module from the controller modules disallows having an arbitrary number of worker pools.

NLB listeners are tightly-coupled to the worker pool they serve. A listener can point at only one worker pool. Since we can have an arbitrary number of worker pools, we should create per-pool listeners (one for HTTP and another for HTTPs). Since we can't create two listeners with the same protocol and port, we parameterize the ports (with defaults of 80/443 for HTTP/HTTPS, respectively).

Bootkube shouldn't depend on workers, and in any case the `workers` module referenced isn't in use.

Target group name length is limited to 32 characters. We should keep the name short as possible to accommodate clusters with long names, and in addition the word "workers" doesn't add any meaning here.

ELB target group names are limited to 32 characters. Since we easily hit this limit when trying to include the cluster and pool name in the target group name, we move this information to AWS tags and let Terraform generate a target group name automatically.

Worker pool names already include the cluster name.

In order to expose Envoy using a NodePort service, we need to wire the NLB to the workers using a port that's within the NodePort range. We also fix the health checks accordingly and remove a SG rule which belongs to nginx.

Exposing services using hostPorts is discouraged by the official k8s docs and has similar problems as using host networking. On Packet, we expose Envoy as a service of type LoadBalancer so using hostPorts isn't necessary. On AWS, we want to expose Envoy as a service of type NodePort rather than using hostPorts. Therefore, we can remove the 'hostPort' fields from the Envoy manifest.

iaguis · 2020-05-12T10:22:40Z

I've rebased and fixed conflicts.

rata

Sticky LGTM :-P

johananl requested review from rata, schu and ipochi April 27, 2020 17:44

johananl mentioned this pull request Apr 27, 2020

AWS: ingress not working by default #349

Closed

johananl force-pushed the johananl/fix-aws branch from bd8b9c8 to 7598b33 Compare April 27, 2020 18:14

johananl requested a review from surajssd as a code owner April 27, 2020 18:14

johananl force-pushed the johananl/fix-aws branch 3 times, most recently from ecbad1a to 4dbc191 Compare April 27, 2020 18:39

johananl requested review from invidian and pothos as code owners April 27, 2020 18:43

johananl force-pushed the johananl/fix-aws branch from 5a9d5dc to c384cff Compare April 27, 2020 18:43

invidian suggested changes Apr 28, 2020

View reviewed changes

surajssd reviewed Apr 28, 2020

View reviewed changes

ci/aws/aws-cluster.lokocfg.envsubst Show resolved Hide resolved

johananl force-pushed the johananl/fix-aws branch 3 times, most recently from e3c0e75 to ccbf8ca Compare April 28, 2020 10:52

johananl requested review from surajssd and invidian April 28, 2020 10:53

johananl force-pushed the johananl/fix-aws branch from ccbf8ca to ba0138a Compare April 28, 2020 11:31

surajssd previously approved these changes Apr 28, 2020

View reviewed changes

johananl dismissed surajssd’s stale review via 6878da1 April 28, 2020 14:28

johananl force-pushed the johananl/fix-aws branch 4 times, most recently from b70e252 to fc2f475 Compare May 11, 2020 19:12

johananl requested review from rata and invidian May 11, 2020 19:41

rata previously approved these changes May 11, 2020

View reviewed changes

assets/components/contour/values.yaml Show resolved Hide resolved

invidian previously approved these changes May 12, 2020

View reviewed changes

johananl added 10 commits May 12, 2020 12:19

platform/aws: Remove workers from controller module

a7e0277

Worker pools are managed in a dedicated module. Invoking the workers module from the controller modules disallows having an arbitrary number of worker pools.

platform/aws: Remove Bootkube's dependency on workers

c59d3b1

Bootkube shouldn't depend on workers, and in any case the `workers` module referenced isn't in use.

platform/aws: Remove 'workers' from target group name

22116e3

Target group name length is limited to 32 characters. We should keep the name short as possible to accommodate clusters with long names, and in addition the word "workers" doesn't add any meaning here.

ci/aws: Remove duplicate info in worker pool name

1a4dd25

Worker pool names already include the cluster name.

test: Test ingress on AWS

88b3708

platform/aws: Fix LB wiring

0a895bb

In order to expose Envoy using a NodePort service, we need to wire the NLB to the workers using a port that's within the NodePort range. We also fix the health checks accordingly and remove a SG rule which belongs to nginx.

iaguis dismissed stale reviews from invidian and rata via 8d44c85 May 12, 2020 10:22

iaguis force-pushed the johananl/fix-aws branch from fc2f475 to 8d44c85 Compare May 12, 2020 10:22

iaguis requested a review from invidian May 12, 2020 10:45

invidian approved these changes May 12, 2020

View reviewed changes

rata approved these changes May 12, 2020

View reviewed changes

iaguis merged commit 4d3d432 into master May 12, 2020

iaguis deleted the johananl/fix-aws branch May 12, 2020 13:29

rata mentioned this pull request Sep 10, 2020

MetalLB: Fix regressions of tolerations and nodeSelectors #927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AWS worker pool handling #367

Fix AWS worker pool handling #367

johananl commented Apr 27, 2020 •

edited

Loading

johananl commented Apr 27, 2020

johananl commented Apr 27, 2020

invidian left a comment

surajssd commented Apr 28, 2020

johananl commented Apr 28, 2020

invidian commented Apr 28, 2020

johananl commented Apr 28, 2020

surajssd commented Apr 28, 2020

johananl commented Apr 28, 2020

surajssd left a comment

rata commented May 11, 2020

johananl commented May 11, 2020

rata left a comment •

edited

Loading

iaguis commented May 12, 2020

rata left a comment

Fix AWS worker pool handling #367

Fix AWS worker pool handling #367

Conversation

johananl commented Apr 27, 2020 • edited Loading

General

Testing

johananl commented Apr 27, 2020

johananl commented Apr 27, 2020

invidian left a comment

Choose a reason for hiding this comment

surajssd commented Apr 28, 2020

johananl commented Apr 28, 2020

invidian commented Apr 28, 2020

johananl commented Apr 28, 2020

surajssd commented Apr 28, 2020

johananl commented Apr 28, 2020

surajssd left a comment

Choose a reason for hiding this comment

rata commented May 11, 2020

johananl commented May 11, 2020

rata left a comment • edited Loading

Choose a reason for hiding this comment

iaguis commented May 12, 2020

rata left a comment

Choose a reason for hiding this comment

johananl commented Apr 27, 2020 •

edited

Loading

rata left a comment •

edited

Loading