Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart poller after all CES are reconciled after nephe restart. #269

Merged
merged 1 commit into from
Jul 22, 2023

Conversation

archanapholla
Copy link
Contributor

No description provided.

@archanapholla archanapholla changed the title [DO NOT REVIEW] Improve UT coverage for azure plugin. Restart poller after all CES are reconciled after nephe restart. Jul 6, 2023
@archanapholla archanapholla changed the title Restart poller after all CES are reconciled after nephe restart. [DO NOT REVIEW] Restart poller after all CES are reconciled after nephe restart. Jul 6, 2023
@archanapholla archanapholla changed the title [DO NOT REVIEW] Restart poller after all CES are reconciled after nephe restart. Restart poller after all CES are reconciled after nephe restart. Jul 7, 2023
if exists {
accPoller.restartPoller(accNamespacedName)
// wait for polling to complete after restart.
retError = accPoller.waitForPollDone(accNamespacedName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the error from accPoller.waitForPollDone(accNamespacedName) will be lost.
Can you simulate an error for this scenario and verify if the CES status is updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error from "accPoller.waitForPollDone(accNamespacedName)" is not lost as we can have this kind of wait function in defer function.
However, I saw an issue setting return value of AddResourceFiltersToAccount() from defer function.
I had to return the address.
Also, I tested the negative scenario, we should be good.

@@ -159,31 +164,56 @@ func (a *AccountManager) RemoveAccount(namespacedName *types.NamespacedName) err
// restart poller once done.
func (a *AccountManager) AddResourceFiltersToAccount(accNamespacedName *types.NamespacedName,
selectorNamespacedName *types.NamespacedName, selector *crdv1alpha1.CloudEntitySelector, replay bool) (bool, error) {
var retError error = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to initialise it to nil.

@@ -61,6 +61,8 @@ type AccountManager struct {
Inventory inventory.Interface
accPollers map[types.NamespacedName]*accountPoller
accountConfigMap map[types.NamespacedName]*accountConfig
// accountToNumSelectors stores number of configured CloudEntitySelector CRs for a given account, required for handling nephe restart case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: split comment into two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"provider type not found", selectorNamespacedName, accNamespacedName))
retError = fmt.Errorf(fmt.Sprintf("failed to add or update selector %v, account %v: "+
"provider type not found", *selectorNamespacedName, *accNamespacedName))
return false, retError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of creating a new error variable here vs directly returning the error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want the defer function to get executed too which can alter the return value.

@@ -83,10 +83,10 @@ func (v *CESMutator) Handle(_ context.Context, req admission.Request) admission.
referencedAccount := &v1alpha1.CloudProviderAccount{}
err = v.Client.Get(context.TODO(), *accountNameSpacedName, referencedAccount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Change to if err = (); err != nil {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -127,7 +127,7 @@ func (ec2Cfg *ec2ServiceConfig) getManagedVpcIDs() map[string]struct{} {
awsPluginLogger().V(4).Info("Cache snapshot nil", "type", providerType, "account", ec2Cfg.accountNamespacedName)
return vpcIDsCopy
}
vpcIDsSet := snapshot.(*ec2ResourcesCacheSnapshot).managedVpcIDs
vpcIDsSet := snapshot.(*ec2ResourcesCacheSnapshot).managedVpcIds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: vpcIdsSet, vpcIdsCopy.
If you are renaming the variables, please rename all the above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all in azure_compute.go and aws_ec2.go .

@@ -69,11 +69,10 @@ func (r *CloudEntitySelectorReconciler) Reconcile(ctx context.Context, req ctrl.
return ctrl.Result{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return ctrl.Result{}, r.processDelete(&req.NamespacedName) and remove L68

@@ -158,32 +164,57 @@ func (a *AccountManager) RemoveAccount(namespacedName *types.NamespacedName) err
// AddResourceFiltersToAccount add/update cloud plugin for a given account to include the new selector and
// restart poller once done.
func (a *AccountManager) AddResourceFiltersToAccount(accNamespacedName *types.NamespacedName,
selectorNamespacedName *types.NamespacedName, selector *crdv1alpha1.CloudEntitySelector, replay bool) (bool, error) {
selectorNamespacedName *types.NamespacedName, selector *crdv1alpha1.CloudEntitySelector, replay bool) (bool, *error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is error used as pointer.. Consider changing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it being a pointer, we cannot set return value in defer function. If it is a regular variable then the retError set in defer function doesn't get returned to the caller. As a side-effect status setting in CES wouldn't work.

@@ -102,11 +106,13 @@ func (a *AccountManager) AddAccount(namespacedName *types.NamespacedName, accoun
// Create an account poller for polling cloud inventory.
accPoller, exists := a.addAccountPoller(cloudInterface, namespacedName, account)
if !exists {
if !crd.DoesCesCrExistsForAccount(a.Client, namespacedName) {
numCes := crd.GetNumberOfCesCrForAccount(a.Client, namespacedName)
Copy link
Contributor

@reachjainrahul reachjainrahul Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetCesCrForAccount

@@ -138,13 +136,13 @@ func (r *CloudEntitySelectorReconciler) processCreateOrUpdate(selector *crdv1alp
Name: selector.Spec.AccountName,
}
r.selectorToAccountMap[*selectorNamespacedName] = *accountNamespacedName
retry, err := r.AccManager.AddResourceFiltersToAccount(accountNamespacedName, selectorNamespacedName,
// No errors can be retried for a CES.
_, err := r.AccManager.AddResourceFiltersToAccount(accountNamespacedName, selectorNamespacedName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Anandkumar26 Can you confirm this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The error while processing CES can only occur if an account is either not initialised successfully or account is in a bad state. I don't think there is any benefit in retrying the errors forever.

So we capture the error and store in a CES status field. (which can be replayed when account goes from an uninitialised to initialised state OR it would need a user intervention)

@reachjainrahul
Copy link
Contributor

/nephe-test-e2e-agentless

@archanapholla archanapholla force-pushed the multi_ces_restart branch 4 times, most recently from 196f9d9 to a395391 Compare July 21, 2023 20:15
// Make sure reconciler retriable error cannot occur during restart, it can affect the logic in UpdatePendingCes function.
if count := r.AccManager.UpdatePendingCesCount(accountNamespacedName); count == 0 {
if err := r.AccManager.WaitForPollDone(accountNamespacedName); err != nil {
r.Log.Error(err, "Inventory poll failed", "account", accountNamespacedName, "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inventory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -144,6 +143,12 @@ func (r *CloudEntitySelectorReconciler) processCreateOrUpdate(selector *crdv1alp
return err
}
r.updateStatus(selectorNamespacedName, err)
// Make sure reconciler retriable error cannot occur during restart, it can affect the logic in UpdatePendingCes function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add detail comment on what it means..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a TODO as discussed

@reachjainrahul
Copy link
Contributor

/nephe-test-e2e-agentless-upgrade

- On nephe restart, In order to avoid poller restart on CPA and every CES add,
postpone restart till all CES attached to the CPA are reconciled.
- Fix logging in CES webhook.

Signed-off-by: Archana Holla <harchana@vmware.com>
@reachjainrahul
Copy link
Contributor

/nephe-test-e2e-agentless

1 similar comment
@reachjainrahul
Copy link
Contributor

/nephe-test-e2e-agentless

Copy link
Contributor

@reachjainrahul reachjainrahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@reachjainrahul reachjainrahul merged commit f384874 into main Jul 22, 2023
@reachjainrahul reachjainrahul deleted the multi_ces_restart branch July 22, 2023 05:41
@reachjainrahul reachjainrahul added this to the Nephe v0.6.0 release milestone Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants