OCPVE-635: fix: allow multi-node readiness with master nodes with NoSchedule Taints #383

jakobmoellerdev · 2023-08-16T08:48:43Z

Description of problem:

Whenever starting a cluster with multiple nodes and trying to attach multiple devices to them, the Cluster does not become ready.

In this case all worker nodes have 2 loop devices with 3GB Block Storage attached. All the VolumeGroupNodeStatus Objects show as ready.

apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"lvm.topolvm.io/v1alpha1","kind":"LVMCluster","metadata":{"annotations":{},"name":"my-lvmcluster","namespace":"openshift-storage"},"spec":{"storage":{"deviceClasses":[{"default":true,"fstype":"xfs","name":"vg1","thinPoolConfig":{"name":"thin-pool-1","overprovisionRatio":10,"sizePercent":90}}]}}}
  creationTimestamp: "2023-08-16T08:24:37Z"
  finalizers:
  - lvmcluster.topolvm.io
  generation: 1
  name: my-lvmcluster
  namespace: openshift-storage
  resourceVersion: "46967"
  uid: fd5c50cd-c2d0-453d-828e-37dae45ddc38
spec:
  storage:
    deviceClasses:
    - default: true
      fstype: xfs
      name: vg1
      thinPoolConfig:
        name: thin-pool-1
        overprovisionRatio: 10
        sizePercent: 90
status:
  deviceClassStatuses:
  - name: vg1
    nodeStatus:
    - devices:
      - /dev/loop0
      - /dev/loop1
      node: ip-10-0-144-179.us-east-2.compute.internal
      status: Ready
    - devices:
      - /dev/loop0
      - /dev/loop1
      node: ip-10-0-168-219.us-east-2.compute.internal
      status: Ready
    - devices:
      - /dev/loop0
      - /dev/loop1
      node: ip-10-0-240-151.us-east-2.compute.internal
      status: Ready
  state: Progressing

Version-Release number of selected component (if applicable):

4.13-4.15

How reproducible:

100%

Steps to Reproduce:

Create a Multi Node Cluster with 3 worker nodes
Attach Loop Storage to them:
Create an LVM Cluster as seen above with a generic xfs formatting and a greedy lookup for devices (no deviceSelector)
Observer the Cluster does not get into ready even though all components are Ready.

Actual results:

Cluster does not get ready.
When injecting a log message on the readiness check one can see in comes from the VG comparison:
{"level":"info","ts":"2023-08-16T08:24:40Z","logger":"lvmcluster-controller","msg":"Verifying readiness","Request.Name":"my-lvmcluster","Request.Namespace":"openshift-storage","expectedVGCount":6,"readyVGCount":3}

For some reason, the expectedVGCount is 6 while readyVGCount is only 3.

Expected results:

Cluster becomes ready and VGCounts match.

Additional Notes:
Can not be automatically tested without multi-node tests, after that it should be covered automatically.

openshift-ci-robot · 2023-08-16T08:48:46Z

@jakobmoellerdev: This pull request references OCPVE-635 which is a valid jira issue.

In response to this:

Description of problem:

Whenever starting a cluster with multiple nodes and trying to attach multiple devices to them, the Cluster does not become ready.

In this case all worker nodes have 2 loop devices with 3GB Block Storage attached. All the VolumeGroupNodeStatus Objects show as ready.
apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
 annotations:
   kubectl.kubernetes.io/last-applied-configuration: |
     {"apiVersion":"lvm.topolvm.io/v1alpha1","kind":"LVMCluster","metadata":{"annotations":{},"name":"my-lvmcluster","namespace":"openshift-storage"},"spec":{"storage":{"deviceClasses":[{"default":true,"fstype":"xfs","name":"vg1","thinPoolConfig":{"name":"thin-pool-1","overprovisionRatio":10,"sizePercent":90}}]}}}
 creationTimestamp: "2023-08-16T08:24:37Z"
 finalizers:
 - lvmcluster.topolvm.io
 generation: 1
 name: my-lvmcluster
 namespace: openshift-storage
 resourceVersion: "46967"
 uid: fd5c50cd-c2d0-453d-828e-37dae45ddc38
spec:
 storage:
   deviceClasses:
   - default: true
     fstype: xfs
     name: vg1
     thinPoolConfig:
       name: thin-pool-1
       overprovisionRatio: 10
       sizePercent: 90
status:
 deviceClassStatuses:
 - name: vg1
   nodeStatus:
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-144-179.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-168-219.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-240-151.us-east-2.compute.internal
     status: Ready
 state: Progressing
Version-Release number of selected component (if applicable):

4.13-4.15

How reproducible:

100%

Steps to Reproduce:

Create a Multi Node Cluster with 3 worker nodes

Attach Loop Storage to them:

Create an LVM Cluster as seen above with a generic xfs formatting and a greedy lookup for devices (no deviceSelector)

Observer the Cluster does not get into ready even though all components are Ready.

Actual results:

Cluster does not get ready.
When injecting a log message on the readiness check one can see in comes from the VG comparison:
{"level":"info","ts":"2023-08-16T08:24:40Z","logger":"lvmcluster-controller","msg":"Verifying readiness","Request.Name":"my-lvmcluster","Request.Namespace":"openshift-storage","expectedVGCount":6,"readyVGCount":3}

For some reason, the expectedVGCount is 6 while readyVGCount is only 3.

Expected results:

Cluster becomes ready and VGCounts match.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-16T08:49:19Z

@jakobmoellerdev: This pull request references OCPVE-635 which is a valid jira issue.

In response to this:

Description of problem:

Whenever starting a cluster with multiple nodes and trying to attach multiple devices to them, the Cluster does not become ready.

In this case all worker nodes have 2 loop devices with 3GB Block Storage attached. All the VolumeGroupNodeStatus Objects show as ready.
apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
 annotations:
   kubectl.kubernetes.io/last-applied-configuration: |
     {"apiVersion":"lvm.topolvm.io/v1alpha1","kind":"LVMCluster","metadata":{"annotations":{},"name":"my-lvmcluster","namespace":"openshift-storage"},"spec":{"storage":{"deviceClasses":[{"default":true,"fstype":"xfs","name":"vg1","thinPoolConfig":{"name":"thin-pool-1","overprovisionRatio":10,"sizePercent":90}}]}}}
 creationTimestamp: "2023-08-16T08:24:37Z"
 finalizers:
 - lvmcluster.topolvm.io
 generation: 1
 name: my-lvmcluster
 namespace: openshift-storage
 resourceVersion: "46967"
 uid: fd5c50cd-c2d0-453d-828e-37dae45ddc38
spec:
 storage:
   deviceClasses:
   - default: true
     fstype: xfs
     name: vg1
     thinPoolConfig:
       name: thin-pool-1
       overprovisionRatio: 10
       sizePercent: 90
status:
 deviceClassStatuses:
 - name: vg1
   nodeStatus:
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-144-179.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-168-219.us-east-2.compute.internal
     status: Ready
   - devices:
     - /dev/loop0
     - /dev/loop1
     node: ip-10-0-240-151.us-east-2.compute.internal
     status: Ready
 state: Progressing
Version-Release number of selected component (if applicable):

4.13-4.15

How reproducible:

100%

Steps to Reproduce:

Create a Multi Node Cluster with 3 worker nodes

Attach Loop Storage to them:

Create an LVM Cluster as seen above with a generic xfs formatting and a greedy lookup for devices (no deviceSelector)

Observer the Cluster does not get into ready even though all components are Ready.

Actual results:

Cluster does not get ready.
When injecting a log message on the readiness check one can see in comes from the VG comparison:
{"level":"info","ts":"2023-08-16T08:24:40Z","logger":"lvmcluster-controller","msg":"Verifying readiness","Request.Name":"my-lvmcluster","Request.Namespace":"openshift-storage","expectedVGCount":6,"readyVGCount":3}

For some reason, the expectedVGCount is 6 while readyVGCount is only 3.

Expected results:

Cluster becomes ready and VGCounts match.

Additional Notes:
Can not be automatically tested without multi-node tests, after that it should be covered automatically.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov-commenter · 2023-08-16T08:55:36Z

Codecov Report

Merging #383 (40d3c6d) into main (a962b90) will increase coverage by 40.31%.
Report is 18 commits behind head on main.
The diff coverage is 68.00%.

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #383       +/-   ##
===========================================
+ Coverage   16.59%   56.91%   +40.31%     
===========================================
  Files          24       25        +1     
  Lines        2061     2091       +30     
===========================================
+ Hits          342     1190      +848     
+ Misses       1693      819      -874     
- Partials       26       82       +56

Files Changed	Coverage Δ
controllers/lvmcluster_controller_watches.go	`90.32% <ø> (+90.32%)`	⬆️
pkg/vgmanager/vgmanager_controller.go	`0.00% <0.00%> (ø)`
controllers/topolvm_snapshotclass.go	`61.22% <14.28%> (+61.22%)`	⬆️
controllers/lvmcluster_controller.go	`57.72% <28.57%> (+57.72%)`	⬆️
pkg/cluster/leaderelection.go	`66.66% <66.66%> (ø)`
pkg/cluster/sno.go	`72.72% <72.72%> (ø)`
pkg/vgmanager/devices.go	`73.77% <81.25%> (-0.43%)`	⬇️
controllers/topolvm_controller.go	`94.14% <100.00%> (+94.14%)`	⬆️

... and 7 files with indirect coverage changes

jakobmoellerdev · 2023-08-16T08:58:09Z

/hold still verifying fix

jakobmoellerdev · 2023-08-16T09:06:30Z

/unhold manually verified

jakobmoellerdev · 2023-08-16T09:30:24Z

/hold

jakobmoellerdev · 2023-08-16T10:55:42Z

/unhold

controllers/lvmcluster_controller.go

suleymanakbas91 · 2023-08-16T12:15:49Z

/lgtm
/approve

openshift-ci · 2023-08-16T12:19:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jakobmoellerdev, suleymanakbas91

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [suleymanakbas91]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-08-16T12:40:51Z

@jakobmoellerdev: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jakobmoellerdev · 2023-08-17T12:35:27Z

/cherry-pick release-4.14

openshift-cherrypick-robot · 2023-08-17T12:36:10Z

@jakobmoellerdev: new pull request created: #388

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 16, 2023

openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 16, 2023

openshift-ci bot requested review from brandisher and suleymanakbas91 August 16, 2023 08:51

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2023

jakobmoellerdev force-pushed the OCPVE-635-NoSchedule-Readiness branch from 40d3c6d to eddb374 Compare August 16, 2023 09:06

jakobmoellerdev force-pushed the OCPVE-635-NoSchedule-Readiness branch from eddb374 to 2aa8726 Compare August 16, 2023 09:28

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2023

suleymanakbas91 requested changes Aug 16, 2023

View reviewed changes

controllers/lvmcluster_controller.go Outdated Show resolved Hide resolved

fix: allow multi-node readiness with master nodes with NoSchedule Taints

8c1ef88

jakobmoellerdev force-pushed the OCPVE-635-NoSchedule-Readiness branch from 2aa8726 to 8c1ef88 Compare August 16, 2023 11:45

jakobmoellerdev requested a review from suleymanakbas91 August 16, 2023 12:01

openshift-ci bot assigned suleymanakbas91 Aug 16, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2023

openshift-merge-robot merged commit edc65d7 into openshift:main Aug 16, 2023
6 checks passed

openshift-cherrypick-robot mentioned this pull request Aug 17, 2023

[release-4.14] OCPBUGS-17852: fix: allow multi-node readiness with master nodes with NoSchedule Taints #388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPVE-635: fix: allow multi-node readiness with master nodes with NoSchedule Taints #383

OCPVE-635: fix: allow multi-node readiness with master nodes with NoSchedule Taints #383

jakobmoellerdev commented Aug 16, 2023 •

edited

Loading

openshift-ci-robot commented Aug 16, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 16, 2023 •

edited by openshift-ci bot

Loading

codecov-commenter commented Aug 16, 2023

jakobmoellerdev commented Aug 16, 2023

jakobmoellerdev commented Aug 16, 2023

jakobmoellerdev commented Aug 16, 2023

jakobmoellerdev commented Aug 16, 2023

suleymanakbas91 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

jakobmoellerdev commented Aug 17, 2023

openshift-cherrypick-robot commented Aug 17, 2023

OCPVE-635: fix: allow multi-node readiness with master nodes with NoSchedule Taints #383

OCPVE-635: fix: allow multi-node readiness with master nodes with NoSchedule Taints #383

Conversation

jakobmoellerdev commented Aug 16, 2023 • edited Loading

openshift-ci-robot commented Aug 16, 2023 • edited by openshift-ci bot Loading

openshift-ci-robot commented Aug 16, 2023 • edited by openshift-ci bot Loading

codecov-commenter commented Aug 16, 2023

Codecov Report

jakobmoellerdev commented Aug 16, 2023

jakobmoellerdev commented Aug 16, 2023

jakobmoellerdev commented Aug 16, 2023

jakobmoellerdev commented Aug 16, 2023

suleymanakbas91 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

jakobmoellerdev commented Aug 17, 2023

openshift-cherrypick-robot commented Aug 17, 2023

jakobmoellerdev commented Aug 16, 2023 •

edited

Loading

openshift-ci-robot commented Aug 16, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 16, 2023 •

edited by openshift-ci bot

Loading