Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calico-node in etcd mode #10438

Merged
merged 2 commits into from
Dec 19, 2023

Conversation

olevitt
Copy link
Contributor

@olevitt olevitt commented Sep 14, 2023

What type of PR is this?
/kind bug

What this PR does / why we need it:

This PR fixes a regression introduced by #10177 which occurs when running in calico etcd mode.
More details on #10436.

Which issue(s) this PR fixes:
Fixes #10436

Special notes for your reviewer:
This PR fixes a bug that is critical for users running calico in etcd mode. It may be useful to add this to the know issues for v2.23.0

Does this PR introduce a user-facing change?:

Fix calico node when running in etcd mode

Thanks for maintaining this great project 😍

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 14, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 14, 2023
@k8s-ci-robot k8s-ci-robot requested a review from bozzo September 14, 2023 14:34
@k8s-ci-robot
Copy link
Contributor

Hi @olevitt. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 14, 2023
@mzaian
Copy link
Contributor

mzaian commented Sep 14, 2023

Hello @olevitt

Thanks for your PR. I would suggest to use the default PR template also to have the bug note automatically generated for the next release.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 14, 2023
@olevitt
Copy link
Contributor Author

olevitt commented Sep 18, 2023

@mzaian Thanks ! I changed the PR text to be in line with the default template. Feel free to amend anything you think may be improved.

@floryut floryut added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2023
@yankay yankay requested a review from cyclinder September 20, 2023 10:13
@cyclinder
Copy link
Contributor

cyclinder commented Sep 20, 2023

I'm curious why we haven't run into this issue before. Actually, this PR is just moving the config file to configMap. No matter which mode Calico is running in, we need KUBERNETES_NODE_NAME, I believe there was the same issue before, your fix makes sense, thanks.

updated:

refer to https://github.com/projectcalico/calico/blob/master/manifests/calico-etcd.yaml, In etcd mode, we should configure etcd_endpoints, it look like we don't need KUBERNETES_NODE_NAME, I'm curious if your cluster is normal before it is upgraded, can you show the contents of the /etc/cni/net.d/10-calico.conflist?

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2023
@@ -95,13 +95,11 @@ spec:
# Prevents the container from sleeping forever.
- name: SLEEP
value: "false"
{% if calico_datastore == "kdd" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this changes and add follwing env:

           # The location of the etcd cluster.
            - name: ETCD_ENDPOINTS
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: etcd_endpoints

Copy link
Contributor

@cyclinder cyclinder Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test it in your cluster and see if works,

"nodename": "{{ calico_kubelet_name.stdout }}",
{% else %}
"nodename": "{{ calico_baremetal_nodename }}",
{% endif %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert the changes, too

@cyclinder
Copy link
Contributor

ping @olevitt, Can you please respond to me?

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2023
@olevitt
Copy link
Contributor Author

olevitt commented Oct 12, 2023

@cyclinder sorry for the delay.
My understanding on this is that the issue is all about having nodename properly set on each node which was not the case for me after upgrading and I think that's due to #10177 using {{ calico_baremetal_nodename }} resulting in the same name (first controlplane) on all nodes.
ETCD_ENDPOINTS looks fine and are, to my understanding, not related to this bug.

Here is my /etc/cni/net.d/10-calico.conflist (on a random node), after applying the fix in this PR (before applying the fix I had nodename set as the first controlplane).

root@boss4:~# cat /etc/cni/net.d/10-calico.conflist
{
  "name": "k8s-pod-network",
  "cniVersion":"0.3.1",
  "plugins":[
    {
                            "nodename": "boss4",
                            "type": "calico",
        "log_level": "info",
                  "log_file_path": "/var/log/calico/cni/cni.log",
                            "etcd_endpoints": "https://192.168.254.134:2379,https://192.168.254.135:2379,https://192.168.254.136:2379",
        "etcd_cert_file": "/etc/calico/certs/cert.crt",
        "etcd_key_file": "/etc/calico/certs/key.pem",
        "etcd_ca_cert_file": "/etc/calico/certs/ca_cert.crt",
                            "ipam": {
          "type": "calico-ipam",
                        "assign_ipv4": "true"
        },
                                                "policy": {
          "type": "k8s"
        },
                            "kubernetes": {
          "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
        }
    },
    {
      "type":"portmap",
      "capabilities": {
        "portMappings": true
      }
    },
    {
      "type":"bandwidth",
      "capabilities": {
        "bandwidth": true
      }
    }
  ]
}

@cyclinder
Copy link
Contributor

thanks for the clarification @olevitt.

calico_datastore Is it kdd on your cluster? I think there was the same issue before. I think your change applies to all cases. But look at https://github.com/projectcalico/calico/blob/master/manifests/calico-etcd.yaml, It seems we don't need K8S_NODE_NAME in calico etcd mode, I hope our calico manifests to be as consistent as possible with calico upstream.

@olevitt
Copy link
Contributor Author

olevitt commented Oct 13, 2023

@cyclinder calico_datastore is etcd in my cluster and have been since it's creation (almost 3 years ago).
I don't have enough knowledge on calico to say but having a wrong nodename in the configuration definitely make it assume that node configuration and go crazy. That's the bug I encountered when upgrading. Maybe nodename is optional in etcd mode but is used when present ? I may test what happens when no nodename is set in etcd mode

@cyclinder
Copy link
Contributor

Maybe nodename is optional in etcd mode but is used when present ?

I think calico doesn't use nodeName in etcd mode, At least, https://github.com/projectcalico/calico/blob/master/manifests/calico-etcd.yaml don't set it.

I may test what happens when no nodename is set in etcd mode

thanks! Can you add ETCD_ENDPONINTS ENV to your calico-node? just like https://github.com/projectcalico/calico/blob/master/manifests/calico-etcd.yaml

@floryut
Copy link
Member

floryut commented Nov 9, 2023

Maybe nodename is optional in etcd mode but is used when present ?

I think calico doesn't use nodeName in etcd mode, At least, https://github.com/projectcalico/calico/blob/master/manifests/calico-etcd.yaml don't set it.

I may test what happens when no nodename is set in etcd mode

thanks! Can you add ETCD_ENDPONINTS ENV to your calico-node? just like https://github.com/projectcalico/calico/blob/master/manifests/calico-etcd.yaml

@olevitt Can you check @cyclinder comments ?

As soon as it's ok the hold flag can be removed 👍

@cyclinder
Copy link
Contributor

friendly ping ^^^ @olevitt

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 10, 2023
@olevitt
Copy link
Contributor Author

olevitt commented Nov 10, 2023

Once again sorry for the delay. I got to test and it seems that, as you said, removing nodename and adding ETCD_ENDPONINTS ENV to the install-cni init container works, at least on my setup. I will rewrite this PR according to your suggestions

@cyclinder Done.
Not sure if I should add other ETCD env variables such as ETCD_CA_CERT_FILE ?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 10, 2023
@cyclinder
Copy link
Contributor

cyclinder commented Nov 12, 2023

Thanks @olevitt

Not sure if I should add other ETCD env variables such as ETCD_CA_CERT_FILE?

Yes, But we already added these variables. Your changes look good now, Thanks for your work👍

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 12, 2023
@VannTen
Copy link
Contributor

VannTen commented Nov 25, 2023

Chiming in, as our clusters also use calico in etcd mode.

Maybe we could add that deployment mode to the CI test, wdyt @floryut
That kind of breakage seems serious enough, and etcd mode used to be the default, so I suppose there is a number of existing clusters still around.

@floryut
Copy link
Member

floryut commented Nov 28, 2023

Chiming in, as our clusters also use calico in etcd mode.

Maybe we could add that deployment mode to the CI test, wdyt @floryut That kind of breakage seems serious enough, and etcd mode used to be the default, so I suppose there is a number of existing clusters still around.

Never hurt to add more cases to the CI tbh 👍

@VannTen
Copy link
Contributor

VannTen commented Dec 14, 2023

Once we confirm that the test in the linked PR catch the problem, could you add it to your PR (by rebasing or cherry-picking, whatever). That way we can be sure it does fix the problem and we'll catch future stuff.

@olevitt
Copy link
Contributor Author

olevitt commented Dec 14, 2023

Sure, ping me when / if I need to rebase this PR.
Btw, maintainers have permissions to edit / commit to olevitt/kubespray. If anyone wants / needs to do so, feel free to.

@VannTen
Copy link
Contributor

VannTen commented Dec 14, 2023 via email

@yankay
Copy link
Member

yankay commented Dec 19, 2023

Thanks @olevitt @VannTen @cyclinder 👍

merge the PR first so the test case can be added later :-)

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cyclinder, olevitt, yankay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 19, 2023
@k8s-ci-robot k8s-ci-robot merged commit 29ea790 into kubernetes-sigs:master Dec 19, 2023
61 checks passed
@jonathansloman
Copy link

Hi - will this fix also be added to a 2.23.2 release? The issue is preventing us from upgrading our cluster, and kubespray documentation specifies not to skip releases when upgrading (ie, we shouldn't go from 2.22 directly to 2.24), so we need a working 2.23 to give us an upgrade path.

Thank you.

@VannTen
Copy link
Contributor

VannTen commented Jan 2, 2024

I plan to backport it to release-2.23 along with the other necessary PRs (#10725 #10722 ) because my client also needs it, however I'm quite busy and not yet back to work so if you can (backport it) go for it. I don't think there is any complicated conflicts to resolve.

VannTen pushed a commit to VannTen/kubespray that referenced this pull request Jan 4, 2024
* Calico : add ETCD endpoints to install-cni container

* Calico : remove nodename from configmap in etcd mode
VannTen pushed a commit to VannTen/kubespray that referenced this pull request Jan 4, 2024
* Calico : add ETCD endpoints to install-cni container

* Calico : remove nodename from configmap in etcd mode
VannTen added a commit to VannTen/kubespray that referenced this pull request Jan 11, 2024
This reverts commit 29ea790.

Check that CI still catches that bug with node-etcd-client
VannTen pushed a commit to VannTen/kubespray that referenced this pull request Jan 11, 2024
* Calico : add ETCD endpoints to install-cni container

* Calico : remove nodename from configmap in etcd mode
k8s-ci-robot pushed a commit that referenced this pull request Jan 12, 2024
* CI: Document the 'all-in-one' layout + small refactoring (#10725)

* Rename aio to all-in-one and document it

ADTM.
Acronyms don't tell much.

* Refactor vm_count in tests provisioning

* Add test case for calico using etcd datastore (#10722)

* Add multinode ci layout

* Add test case for calico using etcd datastore

* Fix calico-node in etcd mode (#10438)

* Calico : add ETCD endpoints to install-cni container

* Calico : remove nodename from configmap in etcd mode

---------

Co-authored-by: Olivier Levitt <olivier.levitt@gmail.com>
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
* Calico : add ETCD endpoints to install-cni container

* Calico : remove nodename from configmap in etcd mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calico + etcd mayhem when upgrading to v2.23.0
8 participants