Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPVE-635: fix: allow multi-node readiness with master nodes with NoSchedule Taints #383

Conversation

jakobmoellerdev
Copy link
Contributor

@jakobmoellerdev jakobmoellerdev commented Aug 16, 2023

Description of problem:

Whenever starting a cluster with multiple nodes and trying to attach multiple devices to them, the Cluster does not become ready.

In this case all worker nodes have 2 loop devices with 3GB Block Storage attached. All the VolumeGroupNodeStatus Objects show as ready.

apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"lvm.topolvm.io/v1alpha1","kind":"LVMCluster","metadata":{"annotations":{},"name":"my-lvmcluster","namespace":"openshift-storage"},"spec":{"storage":{"deviceClasses":[{"default":true,"fstype":"xfs","name":"vg1","thinPoolConfig":{"name":"thin-pool-1","overprovisionRatio":10,"sizePercent":90}}]}}}
  creationTimestamp: "2023-08-16T08:24:37Z"
  finalizers:
  - lvmcluster.topolvm.io
  generation: 1
  name: my-lvmcluster
  namespace: openshift-storage
  resourceVersion: "46967"
  uid: fd5c50cd-c2d0-453d-828e-37dae45ddc38
spec:
  storage:
    deviceClasses:
    - default: true
      fstype: xfs
      name: vg1
      thinPoolConfig:
        name: thin-pool-1
        overprovisionRatio: 10
        sizePercent: 90
status:
  deviceClassStatuses:
  - name: vg1
    nodeStatus:
    - devices:
      - /dev/loop0
      - /dev/loop1
      node: ip-10-0-144-179.us-east-2.compute.internal
      status: Ready
    - devices:
      - /dev/loop0
      - /dev/loop1
      node: ip-10-0-168-219.us-east-2.compute.internal
      status: Ready
    - devices:
      - /dev/loop0
      - /dev/loop1
      node: ip-10-0-240-151.us-east-2.compute.internal
      status: Ready
  state: Progressing

Version-Release number of selected component (if applicable):

4.13-4.15

How reproducible:

100%

Steps to Reproduce:

  1. Create a Multi Node Cluster with 3 worker nodes
  2. Attach Loop Storage to them:
  3. Create an LVM Cluster as seen above with a generic xfs formatting and a greedy lookup for devices (no deviceSelector)
  4. Observer the Cluster does not get into ready even though all components are Ready.

Actual results:

Cluster does not get ready.
When injecting a log message on the readiness check one can see in comes from the VG comparison:
{"level":"info","ts":"2023-08-16T08:24:40Z","logger":"lvmcluster-controller","msg":"Verifying readiness","Request.Name":"my-lvmcluster","Request.Namespace":"openshift-storage","expectedVGCount":6,"readyVGCount":3}

For some reason, the expectedVGCount is 6 while readyVGCount is only 3.

Expected results:

Cluster becomes ready and VGCounts match.

Additional Notes:
Can not be automatically tested without multi-node tests, after that it should be covered automatically.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 16, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 16, 2023

@jakobmoellerdev: This pull request references OCPVE-635 which is a valid jira issue.

In response to this:

Description of problem:

Whenever starting a cluster with multiple nodes and trying to attach multiple devices to them, the Cluster does not become ready.

In this case all worker nodes have 2 loop devices with 3GB Block Storage attached. All the VolumeGroupNodeStatus Objects show as ready.

apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
 annotations:
   kubectl.kubernetes.io/last-applied-configuration: |
     {"apiVersion":"lvm.topolvm.io/v1alpha1","kind":"LVMCluster","metadata":{"annotations":{},"name":"my-lvmcluster","namespace":"openshift-storage"},"spec":{"storage":{"deviceClasses":[{"default":true,"fstype":"xfs","name":"vg1","thinPoolConfig":{"name":"thin-pool-1","overprovisionRatio":10,"sizePercent":90}}]}}}
 creationTimestamp: "2023-08-16T08:24:37Z"
 finalizers:
 - lvmcluster.topolvm.io
 generation: 1
 name: my-lvmcluster
 namespace: openshift-storage
 resourceVersion: "46967"
 uid: fd5c50cd-c2d0-453d-828e-37dae45ddc38
spec:
 storage:
   deviceClasses:
   - default: true
     fstype: xfs
     name: vg1
     thinPoolConfig:
       name: thin-pool-1
       overprovisionRatio: 10
       sizePercent: 90
status:
 deviceClassStatuses:
 - name: vg1
   nodeStatus:
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-144-179.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-168-219.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-240-151.us-east-2.compute.internal
     status: Ready
 state: Progressing

Version-Release number of selected component (if applicable):

4.13-4.15

How reproducible:

100%

Steps to Reproduce:

  1. Create a Multi Node Cluster with 3 worker nodes
  2. Attach Loop Storage to them:
  3. Create an LVM Cluster as seen above with a generic xfs formatting and a greedy lookup for devices (no deviceSelector)
  4. Observer the Cluster does not get into ready even though all components are Ready.

Actual results:

Cluster does not get ready.
When injecting a log message on the readiness check one can see in comes from the VG comparison:
{"level":"info","ts":"2023-08-16T08:24:40Z","logger":"lvmcluster-controller","msg":"Verifying readiness","Request.Name":"my-lvmcluster","Request.Namespace":"openshift-storage","expectedVGCount":6,"readyVGCount":3}

For some reason, the expectedVGCount is 6 while readyVGCount is only 3.

Expected results:

Cluster becomes ready and VGCounts match.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 16, 2023

@jakobmoellerdev: This pull request references OCPVE-635 which is a valid jira issue.

In response to this:

Description of problem:

Whenever starting a cluster with multiple nodes and trying to attach multiple devices to them, the Cluster does not become ready.

In this case all worker nodes have 2 loop devices with 3GB Block Storage attached. All the VolumeGroupNodeStatus Objects show as ready.

apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
 annotations:
   kubectl.kubernetes.io/last-applied-configuration: |
     {"apiVersion":"lvm.topolvm.io/v1alpha1","kind":"LVMCluster","metadata":{"annotations":{},"name":"my-lvmcluster","namespace":"openshift-storage"},"spec":{"storage":{"deviceClasses":[{"default":true,"fstype":"xfs","name":"vg1","thinPoolConfig":{"name":"thin-pool-1","overprovisionRatio":10,"sizePercent":90}}]}}}
 creationTimestamp: "2023-08-16T08:24:37Z"
 finalizers:
 - lvmcluster.topolvm.io
 generation: 1
 name: my-lvmcluster
 namespace: openshift-storage
 resourceVersion: "46967"
 uid: fd5c50cd-c2d0-453d-828e-37dae45ddc38
spec:
 storage:
   deviceClasses:
   - default: true
     fstype: xfs
     name: vg1
     thinPoolConfig:
       name: thin-pool-1
       overprovisionRatio: 10
       sizePercent: 90
status:
 deviceClassStatuses:
 - name: vg1
   nodeStatus:
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-144-179.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-168-219.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-240-151.us-east-2.compute.internal
     status: Ready
 state: Progressing

Version-Release number of selected component (if applicable):

4.13-4.15

How reproducible:

100%

Steps to Reproduce:

  1. Create a Multi Node Cluster with 3 worker nodes
  2. Attach Loop Storage to them:
  3. Create an LVM Cluster as seen above with a generic xfs formatting and a greedy lookup for devices (no deviceSelector)
  4. Observer the Cluster does not get into ready even though all components are Ready.

Actual results:

Cluster does not get ready.
When injecting a log message on the readiness check one can see in comes from the VG comparison:
{"level":"info","ts":"2023-08-16T08:24:40Z","logger":"lvmcluster-controller","msg":"Verifying readiness","Request.Name":"my-lvmcluster","Request.Namespace":"openshift-storage","expectedVGCount":6,"readyVGCount":3}

For some reason, the expectedVGCount is 6 while readyVGCount is only 3.

Expected results:

Cluster becomes ready and VGCounts match.

Additional Notes:
Can not be automatically tested without multi-node tests, after that it should be covered automatically.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 16, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #383 (40d3c6d) into main (a962b90) will increase coverage by 40.31%.
Report is 18 commits behind head on main.
The diff coverage is 68.00%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main     #383       +/-   ##
===========================================
+ Coverage   16.59%   56.91%   +40.31%     
===========================================
  Files          24       25        +1     
  Lines        2061     2091       +30     
===========================================
+ Hits          342     1190      +848     
+ Misses       1693      819      -874     
- Partials       26       82       +56     
Files Changed Coverage Δ
controllers/lvmcluster_controller_watches.go 90.32% <ø> (+90.32%) ⬆️
pkg/vgmanager/vgmanager_controller.go 0.00% <0.00%> (ø)
controllers/topolvm_snapshotclass.go 61.22% <14.28%> (+61.22%) ⬆️
controllers/lvmcluster_controller.go 57.72% <28.57%> (+57.72%) ⬆️
pkg/cluster/leaderelection.go 66.66% <66.66%> (ø)
pkg/cluster/sno.go 72.72% <72.72%> (ø)
pkg/vgmanager/devices.go 73.77% <81.25%> (-0.43%) ⬇️
controllers/topolvm_controller.go 94.14% <100.00%> (+94.14%) ⬆️

... and 7 files with indirect coverage changes

@jakobmoellerdev
Copy link
Contributor Author

/hold still verifying fix

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2023
@jakobmoellerdev
Copy link
Contributor Author

/unhold manually verified

@jakobmoellerdev
Copy link
Contributor Author

/hold

@jakobmoellerdev
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2023
@suleymanakbas91
Copy link
Contributor

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jakobmoellerdev, suleymanakbas91

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2023

@jakobmoellerdev: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit edc65d7 into openshift:main Aug 16, 2023
6 checks passed
@jakobmoellerdev
Copy link
Contributor Author

/cherry-pick release-4.14

@openshift-cherrypick-robot

@jakobmoellerdev: new pull request created: #388

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants