Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPVE-627: fix: concurrent apply / status checks -> LVMCluster #391

Conversation

jakobmoellerdev
Copy link
Contributor

@jakobmoellerdev jakobmoellerdev commented Aug 21, 2023

This PR introduces concurrent processing for the ensureCreated operations for each resourceManager in the LVMClusterReconciler controller, which significantly improves the overall reconciliation time for the LVMCluster custom resource.

The newly added multierror.go groups multiple errors into a single error, thereby helping to manage multiple errors generated from concurrent ensureCreated calls. This helps the concurrent apply since errors are not cancelled when one call fails, and multiple failures can occur at once.

Moreover, readiness checks have been added after the creation of Deployment and DaemonSet resources in the topolvm_controller.go, topolvm_node.go, and vgmanager.go files respectively.
These checks ensure that the deployment or DaemonSet has been completely initialized and is ready for use.
It also adds verifyDeploymentReadiness and verifyDaemonSetReadiness functions in utils.go to verify the readiness of Deployment and DaemonSet resources respectively.

Note: The checks currently do not get propagated to LVMCluster yet because the LVMCluster State is independant from the Readiness checks done during reconciliation. We will need an API Change to propagate this correctly (in a later change) as otherwise we would set LVMCluster to Degraded/Failed but couldnt show why (because the only status message fields in LVMCluster are for volume group failures, not any of the basic deployed components, and there is no vg present at the time of the readiness check)

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 21, 2023

@jakobmoellerdev: This pull request references OCPVE-627 which is a valid jira issue.

In response to this:

This PR introduces concurrent processing for the ensureCreated operations for each resourceManager in the LVMClusterReconciler controller, which significantly improves the overall reconciliation time for the LVMCluster custom resource.

The newly added multierror.go groups multiple errors into a single error, thereby helping to manage multiple errors generated from concurrent ensureCreated calls. This helps the concurrent apply since errors are not cancelled when one call fails, and multiple failures can occur at once.

Moreover, readiness checks have been added after the creation of Deployment and DaemonSet resources in the topolvm_controller.go, topolvm_node.go, and vgmanager.go files respectively.
These checks ensure that the deployment or DaemonSet has been completely initialized and is ready for use.
It also adds verifyDeploymentReadiness and verifyDaemonSetReadiness functions in utils.go to verify the readiness of Deployment and DaemonSet resources respectively.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 21, 2023
@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 21, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 21, 2023

@jakobmoellerdev: This pull request references OCPVE-627 which is a valid jira issue.

In response to this:

This PR introduces concurrent processing for the ensureCreated operations for each resourceManager in the LVMClusterReconciler controller, which significantly improves the overall reconciliation time for the LVMCluster custom resource.

The newly added multierror.go groups multiple errors into a single error, thereby helping to manage multiple errors generated from concurrent ensureCreated calls. This helps the concurrent apply since errors are not cancelled when one call fails, and multiple failures can occur at once.

Moreover, readiness checks have been added after the creation of Deployment and DaemonSet resources in the topolvm_controller.go, topolvm_node.go, and vgmanager.go files respectively.
These checks ensure that the deployment or DaemonSet has been completely initialized and is ready for use.
It also adds verifyDeploymentReadiness and verifyDaemonSetReadiness functions in utils.go to verify the readiness of Deployment and DaemonSet resources respectively.

TODO:

  • add test adjustments because envtest does not have a running controller for deployments and we will need to mock the readiness of the created resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


var errs []error
for i := 0; i < len(resources); i++ {
if err := <-results; err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if something goes wrong and we will stuck there forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the code path there should not be a way to get stuck here. This will always return exactly the amount of times that a resource is attempted to be created. This means that there will always be the correct amount of results, no matter the amount of failures in the goroutines.

return ctrl.Result{}, err
resourceSyncStart := time.Now()
results := make(chan error, len(resources))
create := func(i int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errgroup cancel the context on the first error. We dont want the creation of other resources to abort in case of an error on one of the resources.

controllers/lvmcluster_controller.go Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #391 (8c5e2bc) into main (a962b90) will increase coverage by 39.78%.
Report is 28 commits behind head on main.
The diff coverage is 63.98%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main     #391       +/-   ##
===========================================
+ Coverage   16.59%   56.37%   +39.78%     
===========================================
  Files          24       26        +2     
  Lines        2061     2290      +229     
===========================================
+ Hits          342     1291      +949     
+ Misses       1693      904      -789     
- Partials       26       95       +69     
Files Changed Coverage Δ
controllers/lvmcluster_controller_watches.go 90.32% <ø> (+90.32%) ⬆️
controllers/topolvm_node.go 91.76% <0.00%> (+91.76%) ⬆️
controllers/vgmanager.go 68.00% <0.00%> (+29.70%) ⬆️
pkg/vgmanager/lvm.go 52.03% <ø> (+4.06%) ⬆️
controllers/topolvm_snapshotclass.go 61.22% <14.28%> (+61.22%) ⬆️
controllers/utils.go 45.09% <34.61%> (+5.09%) ⬆️
pkg/vgmanager/vgmanager_controller.go 10.40% <43.10%> (+10.40%) ⬆️
pkg/cluster/leaderelection.go 66.66% <66.66%> (ø)
controllers/lvmcluster_controller.go 60.00% <71.42%> (+60.00%) ⬆️
pkg/cluster/sno.go 72.72% <72.72%> (ø)
... and 3 more

... and 5 files with indirect coverage changes

@jakobmoellerdev
Copy link
Contributor Author

/test ci-index-lvm-operator-bundle
/test images

@qJkee
Copy link
Contributor

qJkee commented Aug 22, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 22, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 22, 2023

@jakobmoellerdev: This pull request references OCPVE-627 which is a valid jira issue.

In response to this:

This PR introduces concurrent processing for the ensureCreated operations for each resourceManager in the LVMClusterReconciler controller, which significantly improves the overall reconciliation time for the LVMCluster custom resource.

The newly added multierror.go groups multiple errors into a single error, thereby helping to manage multiple errors generated from concurrent ensureCreated calls. This helps the concurrent apply since errors are not cancelled when one call fails, and multiple failures can occur at once.

Moreover, readiness checks have been added after the creation of Deployment and DaemonSet resources in the topolvm_controller.go, topolvm_node.go, and vgmanager.go files respectively.
These checks ensure that the deployment or DaemonSet has been completely initialized and is ready for use.
It also adds verifyDeploymentReadiness and verifyDaemonSetReadiness functions in utils.go to verify the readiness of Deployment and DaemonSet resources respectively.

Note: The checks currently do not get propagated to LVMCluster yet because the LVMCluster State is independant from the Readiness checks done during reconciliation. We will need an API Change to propagate this correctly (in a later change) as otherwise we would set LVMCluster to Degraded/Failed but couldnt show why (because the only status message fields in LVMCluster are for volume group failures, not any of the basic deployed components, and there is no vg present at the time of the readiness check)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

controllers/topolvm_node.go Outdated Show resolved Hide resolved
controllers/vgmanager.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2023
@jakobmoellerdev jakobmoellerdev force-pushed the OCPVE-627-concurrent-resources branch 2 times, most recently from 03b69bb to 45ba774 Compare August 23, 2023 08:29
@jakobmoellerdev
Copy link
Contributor Author

/test lvm-operator-e2e-aws

@jakobmoellerdev
Copy link
Contributor Author

/test lvm-operator-e2e-aws

@suleymanakbas91
Copy link
Contributor

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 23, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jakobmoellerdev, suleymanakbas91

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 23, 2023
@jakobmoellerdev
Copy link
Contributor Author

/test lvm-operator-e2e-aws

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 23, 2023

@jakobmoellerdev: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 556d16c into openshift:main Aug 23, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants