Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amazonvpc is not working with Ubuntu 22.04(Jammy) #15720

Closed
h3poteto opened this issue Jul 30, 2023 · 15 comments · Fixed by #16313
Closed

amazonvpc is not working with Ubuntu 22.04(Jammy) #15720

h3poteto opened this issue Jul 30, 2023 · 15 comments · Fixed by #16313
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@h3poteto
Copy link
Contributor

/kind bug

**1. What kops version are you running? The command kops version, will display

$ kops version
Client version: 1.27.0 (git-v1.27.0)

I tried the same thing with the master branch (a8fa895).

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"archive", BuildDate:"2023-06-15T08:14:06Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:14:49Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

$ kops create -f cluster.yaml
$ kops update cluster --name $CLUSTER_NAME --admin --yes

5. What happened after the commands executed?
A Kubernetes cluster is created, but some pods are not working. So kops validate command fails.
For example, cert-manager and ebs-csi-node say errors.

$ k get pods -n kube-system 
NAME                                     READY   STATUS              RESTARTS        AGE
aws-cloud-controller-manager-hlxjj       1/1     Running             0               14m
aws-cloud-controller-manager-p2ggb       1/1     Running             0               14m
aws-cloud-controller-manager-p5nk6       1/1     Running             0               13m
aws-iam-authenticator-9zz6q              0/1     ContainerCreating   0               13m
aws-iam-authenticator-mnmgq              0/1     ContainerCreating   0               13m
aws-iam-authenticator-pnncn              0/1     ContainerCreating   0               13m
aws-node-6mbll                           1/1     Running             0               12m
aws-node-84kqw                           1/1     Running             0               14m
aws-node-npfd6                           1/1     Running             0               12m
aws-node-qmcjd                           1/1     Running             0               13m
aws-node-x5z6h                           1/1     Running             0               12m
aws-node-xmm5m                           1/1     Running             0               14m
cert-manager-85495b9754-jnrb9            1/1     Running             0               14m
cert-manager-cainjector-879f4679-tgt42   0/1     CrashLoopBackOff    6 (4m11s ago)   14m
cert-manager-webhook-5c5c9f4f95-wxpgh    0/1     CrashLoopBackOff    6 (4m20s ago)   14m
coredns-69998f855-9kkb8                  0/1     Pending             0               14m
coredns-autoscaler-fcf87bf56-hj4cc       0/1     Pending             0               14m
dns-controller-849b6b44c5-xdjl2          1/1     Running             0               14m
ebs-csi-controller-55847c479b-4dbxz      5/5     Running             0               14m
ebs-csi-controller-55847c479b-bhttn      5/5     Running             0               14m
ebs-csi-node-cpnb6                       2/3     CrashLoopBackOff    6 (4m58s ago)   14m
ebs-csi-node-hj55p                       2/3     CrashLoopBackOff    6 (3m29s ago)   12m
ebs-csi-node-p6hwc                       2/3     CrashLoopBackOff    6 (3m16s ago)   12m
ebs-csi-node-sb7sn                       2/3     CrashLoopBackOff    6 (3m32s ago)   12m
ebs-csi-node-wwkxn                       2/3     CrashLoopBackOff    6 (4m22s ago)   13m
ebs-csi-node-xcgjt                       2/3     CrashLoopBackOff    6 (4m57s ago)   14m
kops-controller-4r62d                    1/1     Running             0               14m
kops-controller-5jg69                    1/1     Running             0               13m
kops-controller-wxttx                    1/1     Running             0               14m
kube-apiserver-i-01a0c4dbe2535a577       2/2     Running             3 (15m ago)     13m
kube-apiserver-i-0cf34834c7d869a4c       2/2     Running             3 (15m ago)     13m
kube-apiserver-i-0e010cb5829cbd426       2/2     Running             4 (14m ago)     12m
pod-identity-webhook-7b4747876c-z2kgw    0/1     Pending             0               14m
pod-identity-webhook-7b4747876c-zwbl4    0/1     Pending             0               14m

ebs-csi-node:

$ k logs -f ebs-csi-node-hj55p --previous
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I0730 11:38:38.590223       1 node.go:91] regionFromSession Node service
I0730 11:38:38.590349       1 metadata.go:85] retrieving instance data from ec2 metadata
W0730 11:38:44.919462       1 metadata.go:88] ec2 metadata is not available
I0730 11:38:44.919486       1 metadata.go:96] retrieving instance data from kubernetes api
I0730 11:38:44.920033       1 metadata.go:101] kubernetes api is available
panic: error getting Node i-0d639425eec9d7b0b: Get "https://100.64.0.1:443/api/v1/nodes/i-0d639425eec9d7b0b": dial tcp 100.64.0.1:443: i/o timeout

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc000638640)
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:94 +0x345
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc00023bf30, 0x8, 0x3684458?})
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:95 +0x393
main.main()
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:46 +0x37d

cert-manager-webhook:

$ k logs -f cert-manager-webhook-5c5c9f4f95-wxpgh
W0730 11:43:28.030767       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0730 11:43:58.034144       1 webhook.go:123] "cert-manager: Failed initialising server" err="error building admission chain: Get \"https://100.64.0.1:443/api\": dial tcp 100.64.0.1:443: i/o timeout"

6. What did you expect to happen?
All pods work fine, and kops validate succeed.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: playground.k8s.h3poteto.dev
spec:
  api:
    dns: {}
  authentication:
    aws: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://my-playground-store/playground.k8s.h3poteto.dev
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-ap-northeast-1a
      name: a
    - encryptedVolume: true
      instanceGroup: control-plane-ap-northeast-1c
      name: c
    - encryptedVolume: true
      instanceGroup: control-plane-ap-northeast-1d
      name: d
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-ap-northeast-1a
      name: a
    - encryptedVolume: true
      instanceGroup: control-plane-ap-northeast-1c
      name: c
    - encryptedVolume: true
      instanceGroup: control-plane-ap-northeast-1d
      name: d
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    maxPods: 50
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.27.4
  masterPublicName: api.playground.k8s.h3poteto.dev
  networkCIDR: 172.16.0.0/16
  networkID: vpc-00ea717e1640613ea
  networking:
    amazonvpc: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  serviceAccountIssuerDiscovery:
    discoveryStore: s3://my-irsa-store
    enableAWSOIDCProvider: true
  podIdentityWebhook:
    enabled: true
  certManager:
    enabled: true
    managed: true
  sshAccess: []
  subnets:
  - id: subnet-0619c5276e1edce32
    cidr: 172.16.0.0/20
    name: ap-northeast-1a
    type: Public
    zone: ap-northeast-1a
  - id: subnet-04acc221370b74258
    cidr: 172.16.16.0/20
    name: ap-northeast-1c
    type: Public
    zone: ap-northeast-1c
  - id: subnet-06d07ea961e7e0007
    cidr: 172.16.32.0/20
    name: ap-northeast-1d
    type: Public
    zone: ap-northeast-1d
  topology:
    dns:
      type: Public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: playground.k8s.h3poteto.dev
  name: control-plane-ap-northeast-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230711
  machineType: t3.small
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - ap-northeast-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: playground.k8s.h3poteto.dev
  name: control-plane-ap-northeast-1c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230711
  machineType: t3.small
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - ap-northeast-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: playground.k8s.h3poteto.dev
  name: control-plane-ap-northeast-1d
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230711
  machineType: t3.small
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - ap-northeast-1d

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: playground.k8s.h3poteto.dev
  name: nodes-ap-northeast-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230711
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - ap-northeast-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: playground.k8s.h3poteto.dev
  name: nodes-ap-northeast-1c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230711
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - ap-northeast-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: playground.k8s.h3poteto.dev
  name: nodes-ap-northeast-1d
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230711
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - ap-northeast-1d

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
If I specify cilium as CNI in networking, it works fine (I tried Cilium with ENI, and it also works fine).
If I change the image to Ubuntu-20.04 (I tried 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211015), it works fine with amazonvpc.

In conclusion, I suspect the combination of amazonvpc and Ubuntu 22.04.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 30, 2023
@hakman
Copy link
Member

hakman commented Jul 30, 2023

@h3poteto See aws/amazon-vpc-cni-k8s#2103

@h3poteto
Copy link
Contributor Author

Thank you, I got it.
If this issue does not need to be tracked in kOps issue, please close this issue.

@hakman
Copy link
Member

hakman commented Jul 31, 2023

Let's keep it open for some time.

@colt-1
Copy link

colt-1 commented Oct 1, 2023

This still seems to be happening with: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230919

@hakman
Copy link
Member

hakman commented Oct 2, 2023

@colathro this will continue happening until someone from AWS fixes aws/amazon-vpc-cni-k8s#2103.

@pmankad96
Copy link

pmankad96 commented Oct 13, 2023

It also prevents new kops cluster with networking=amazonvpc to come up healthy. In my case core-dns-xx and ebs-csi-node pods kept crashing. For core-dns the log read: plugin/error timeout when trying to connect to Amazon Provided DNS server. For ebs-csi-node the error was related to unable to get the Node (was trying on 100.64. - not sure why). The workaround is to use 20.04 image instead. The error messages are so cryptic that it took me a while to figure out.

@btalbot
Copy link

btalbot commented Oct 19, 2023

I ran into this as well while upgrading a test cluster from kubernetes 1.26.5 to 1.27.6 using kops 1.28.

The error from ebs-plugin container of the ebs-csi-node running on Ubuntu 22.04 is shown below. Reverting the node images to Ubuntu 20.04 (ubuntu-focal-20.04-amd64-server-20230502) allowed the rolling-restart with --cloudonly to cleanly restart the affected control-plane nodes.

+ kube-system ebs-csi-node-sjkv7 › ebs-plugin
kube-system ebs-csi-node-sjkv7 ebs-plugin I1018 23:46:11.891261       1 metadata.go:101] kubernetes api is available
kube-system ebs-csi-node-sjkv7 ebs-plugin panic: error getting Node i-04bddcf2fcb369bae: Get "https://100.64.0.1:443/api/v1/nodes/i-04bddcf2fcb369bae": dial tcp 100.64.0.1:443: i/o timeout
kube-system ebs-csi-node-sjkv7 ebs-plugin
kube-system ebs-csi-node-sjkv7 ebs-plugin goroutine 1 [running]:
kube-system ebs-csi-node-sjkv7 ebs-plugin github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc00003f540)
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:94 +0x345
kube-system ebs-csi-node-sjkv7 ebs-plugin github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc00054df30, 0x8, 0x3684458?})
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:95 +0x393
kube-system ebs-csi-node-sjkv7 ebs-plugin main.main()
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:46 +0x37d
- kube-system ebs-csi-node-sjkv7 › ebs-plugin

@btalbot
Copy link

btalbot commented Oct 19, 2023

Can't kops work around this issue by simply NOT updating to Ubuntu 22.04 for instances running in AWS? Seems silly to keep breaking everyone's clusters like this.

@hakman
Copy link
Member

hakman commented Oct 19, 2023

Can't kops work around this issue by simply NOT updating to Ubuntu 22.04 for instances running in AWS? Seems silly to keep breaking everyone's clusters like this.

kOps is not just about clusters using AWS VPC CNI. All other CNIs and components work fine with Ubuntu 22.04.
Ubuntu 22.04 is 1.5 years old and AWS did not add support for it, with no plan to do so in the near future.

Probably it is a good idea to add something that locks clusters with AWS VPC CNI to Ubuntu 20.04.

@doryer
Copy link

doryer commented Nov 1, 2023

Can't kops work around this issue by simply NOT updating to Ubuntu 22.04 for instances running in AWS? Seems silly to keep breaking everyone's clusters like this.

kOps is not just about clusters using AWS VPC CNI. All other CNIs and components work fine with Ubuntu 22.04. Ubuntu 22.04 is 1.5 years old and AWS did not add support for it, with no plan to do so in the near future.

Probably it is a good idea to add something that locks clusters with AWS VPC CNI to Ubuntu 20.04.

It worth marking it as un-stable (https://github.com/kubernetes/kops/blob/master/docs/operations/images.md) as we tried to upgrade ubuntu version and faced issues in the cluster

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@h3poteto
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@hakman
Copy link
Member

hakman commented Jan 31, 2024

/cc @moshevayner

@moshevayner
Copy link
Member

This is related to #16255
I'm working on a fix, hopefully will have a PR up in the next couple of days 🙏🏼🙏🏼

@moshevayner
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants