Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Fix AWS worker pool handling #367

Merged
merged 10 commits into from
May 12, 2020
Merged

Fix AWS worker pool handling #367

merged 10 commits into from
May 12, 2020

Conversation

johananl
Copy link
Member

@johananl johananl commented Apr 27, 2020

General

Fixes #349.

This PR fixes the handling of worker pools on AWS. Following is a summary of the changes:

  • There used to be an "invisible" worker pool of size 0 that was created with every cluster. This pool is now removed.
  • NLB listeners are now created per worker pool and are correctly wired to the right target groups. Before this change there used to be one "global" set of listeners which were wired to the empty worker pool, practically blackholing all ingress traffic to the cluster.
  • NLB listener ports for HTTP/HTTPS are now configurable with a default of 80/443, respectively. This is required in order to support an arbitrary number of worker pools with ingress using the same NLB.
  • Target group names have been cleaned. These are limited to 32 characters by the AWS API so we want to keep them short. Related issue: Handle AWS ELB target group names > 32 characters #366

Testing

To test this PR, deploy a cluster with a config similar to the following

cluster "aws" {
  asset_dir        = "./assets"
  cluster_name     = "my-cluster"
  controller_count = 1
  dns_zone         = "example.com"
  dns_zone_id      = "xxxxxxxx"
  ssh_pubkeys      = ["ssh-rsa ..."]

  worker_pool "pool-1" {
    instance_type = "t3a.small"
    count         = 2
    ssh_pubkeys   = ["ssh-rsa ..."]
  }
}

component "contour" {
  service_type = "NodePort"
}
component "cert-manager" {
  email = "me@example.com"
}
component "httpbin" {
  ingress_host = "httpbin.example.com"
}

httpbin should be available at https://httpbin.example.com.

@johananl
Copy link
Member Author

We're hitting the size limit of the ELB target group name (32 chars).

@johananl johananl force-pushed the johananl/fix-aws branch 3 times, most recently from ecbad1a to 4dbc191 Compare April 27, 2020 18:39
@johananl
Copy link
Member Author

We're hitting the size limit of the ELB target group name (32 chars).

Fixed.

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small nits, otherwise LGTM. Nice work @johananl

@surajssd
Copy link
Member

httpbin should be available at https://httpbin.example.com.

@johananl I was manually testing this PR, where should httpbin.example.com be pointing to in Route53? This is a manual step right?

@johananl
Copy link
Member Author

httpbin should be available at https://httpbin.example.com.

@johananl I was manually testing this PR, where should httpbin.example.com be pointing to in Route53? This is a manual step right?

Lokomotive creates a wildcard record which routes *.example.com to the NLB. You shouldn't need to touch Route 53 manually. My testing instructions should work exactly as written.

@johananl johananl force-pushed the johananl/fix-aws branch 3 times, most recently from e3c0e75 to ccbf8ca Compare April 28, 2020 10:52
@johananl johananl requested review from surajssd and invidian April 28, 2020 10:53
@invidian
Copy link
Member

Lokomotive creates a wildcard record which routes *.example.com to the NLB. You shouldn't need to touch Route 53 manually. My testing instructions should work exactly as written.

@johananl if this should work fully automatically, perhaps we should add e2e test for it? Which just does the HTTP request which should succeed?

@johananl
Copy link
Member Author

Lokomotive creates a wildcard record which routes *.example.com to the NLB. You shouldn't need to touch Route 53 manually. My testing instructions should work exactly as written.

@johananl if this should work fully automatically, perhaps we should add e2e test for it? Which just does the HTTP request which should succeed?

I can look into that, although I wanted to keep the scope of this PR small and merge it ASAP as things are currently broken.

@surajssd
Copy link
Member

I can look into that, although I wanted to keep the scope of this PR small and merge it ASAP as things are currently broken.

This works for me. Only one difference from deploying on packet is that on packet we need to create a manual entry in route53 and here user is free to choose whatever URL they want.

On AWS we need to make sure that your the URL is <app-name>.<cluster-name>.<dns-zone>.

@johananl
Copy link
Member Author

Only one difference from deploying on packet is that on packet we need to create a manual entry in route53 and here user is free to choose whatever URL they want.

Opened #370 to track this.

On AWS we need to make sure that your the URL is <app-name>.<cluster-name>.<dns-zone>.

Opened #371 to track this.

surajssd
surajssd previously approved these changes Apr 28, 2020
Copy link
Member

@surajssd surajssd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

@rata
Copy link
Member

rata commented May 11, 2020

@johananl This LGTM, as I said in the previous reviews, I don't think I need to review again, the changes that might receive this PR are very small.

Regarding the open comment, as I mentioned, mark it solved when you consider it solved (AFAICS you found problems to test your solution, I think whatever you consider is reasonable at this point, is fine). Regarding the other comments, they are nit-picking so whatever you decide (even leave it as it is), is fine by me :)

If you need, of course, just ping me for a LGTM :)

@johananl johananl force-pushed the johananl/fix-aws branch 4 times, most recently from b70e252 to fc2f475 Compare May 11, 2020 19:12
@johananl johananl requested review from rata and invidian May 11, 2020 19:41
@johananl
Copy link
Member Author

All feedback was handled. Please check again.

rata
rata previously approved these changes May 11, 2020
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot @johananl, this improvement is really important! :)

assets/components/contour/values.yaml Show resolved Hide resolved
invidian
invidian previously approved these changes May 12, 2020
johananl added 10 commits May 12, 2020 12:19
It should be easy to figure out which cluster an AWS resource belongs
to.

- Prefix EC2 `Name` tags with cluster name.
- Prefix ASG name with cluster name.
- Prefix ASG launch config with cluster and pool name.
- Prefix LB target groups with cluster and pool name.
Worker pools are managed in a dedicated module. Invoking the workers
module from the controller modules disallows having an arbitrary
number of worker pools.
NLB listeners are tightly-coupled to the worker pool they serve.
A listener can point at only one worker pool. Since we can have an
arbitrary number of worker pools, we should create per-pool listeners
(one for HTTP and another for HTTPs).
Since we can't create two listeners with the same protocol and port,
we parameterize the ports (with defaults of 80/443 for HTTP/HTTPS,
respectively).
Bootkube shouldn't depend on workers, and in any case the `workers`
module referenced isn't in use.
Target group name length is limited to 32 characters. We should keep
the name short as possible to accommodate clusters with long names,
and in addition the word "workers" doesn't add any meaning here.
ELB target group names are limited to 32 characters. Since we easily
hit this limit when trying to include the cluster and pool name in
the target group name, we move this information to AWS tags and let
Terraform generate a target group name automatically.
Worker pool names already include the cluster name.
In order to expose Envoy using a NodePort service, we need to wire
the NLB to the workers using a port that's within the NodePort range.
We also fix the health checks accordingly and remove a SG rule which
belongs to nginx.
Exposing services using hostPorts is discouraged by the official k8s
docs and has similar problems as using host networking.

On Packet, we expose Envoy as a service of type LoadBalancer so using
hostPorts isn't necessary. On AWS, we want to expose Envoy as a
service of type NodePort rather than using hostPorts. Therefore, we
can remove the 'hostPort' fields from the Envoy manifest.
@iaguis iaguis dismissed stale reviews from invidian and rata via 8d44c85 May 12, 2020 10:22
@iaguis iaguis force-pushed the johananl/fix-aws branch from fc2f475 to 8d44c85 Compare May 12, 2020 10:22
@iaguis
Copy link
Contributor

iaguis commented May 12, 2020

I've rebased and fixed conflicts.

@iaguis iaguis requested a review from invidian May 12, 2020 10:45
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sticky LGTM :-P

@iaguis iaguis merged commit 4d3d432 into master May 12, 2020
@iaguis iaguis deleted the johananl/fix-aws branch May 12, 2020 13:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS: ingress not working by default
6 participants