Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable DSPO Debug Logs by Default and modify the "DSPA resource was not found" message to Debug #678

Closed
wants to merge 2 commits into from

Conversation

amadhusu
Copy link
Contributor

@amadhusu amadhusu commented Jul 25, 2024

The issue resolved by this Pull Request:

Resolves RHOAIENG-1650

Description of your changes:

When deleting a DSPA we continue to see these types of logs:

2023-08-23T17:17:40Z DEBUG Reconcile event triggered by [Pod: mariadb-pipelines-definition-6d848d8d75-hg6hq] {"namespace": "my-ds-project"}
2023-08-23T17:17:40Z DEBUG Reconcile event triggered by [Pod: mariadb-pipelines-definition-6d848d8d75-hg6hq] {"namespace": "my-ds-project"}
2023-08-23T17:17:40Z DEBUG DataSciencePipelinesApplication Reconciler called. {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z INFO DSPA resource was not found {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z DEBUG Reconcile event triggered by [Pod: mariadb-pipelines-definition-6d848d8d75-hg6hq] {"namespace": "my-ds-project"}
2023-08-23T17:17:40Z DEBUG DataSciencePipelinesApplication Reconciler called. {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z INFO DSPA resource was not found {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z DEBUG DataSciencePipelinesApplication Reconciler called. {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z INFO DSPA resource was not found {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"

This PR resolves this by ensuring DSPO logs default to Debug logs with the first commit.
The second commit modifies the "DSPA resources was not found" message to a Debug instead of an Info.

Testing instructions

  1. Deploy the DSPO created by Github
  2. Deploy DSPA
  3. (Optional) Deploy additional DSPA in another namespace
  4. Delete the original DSPA
  5. Observe the DSPO logs to ensure that "DEBUG DSPA resource was not found" does not appear

Checklist

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@dsp-developers
Copy link
Contributor

A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678
An OCP cluster where you are logged in as cluster admin is required.

To use this image run the following:

cd $(mktemp -d)
git clone git@github.com:opendatahub-io/data-science-pipelines-operator.git
cd data-science-pipelines-operator/
git fetch origin pull/678/head
git checkout -b pullrequest bd8b98f0fcf548b0adb3ae855cc78f6fe14a90e1
oc new-project opendatahub
make deploy IMG="quay.io/opendatahub/data-science-pipelines-operator:pr-678"

More instructions here on how to deploy and test a Data Science Pipelines Application.

@@ -178,6 +178,8 @@ const DefaultMaxConcurrentReconciles = 10

const DefaultRequeueTime = time.Second * 20

var DSPAdeleted = make(map[string]bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few question:

  • Are we going to store all DSPA forever?
  • How about concurrent deletion and creation using the same name? How the code is handling that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to store all DSPA forever?

We are storing just to compare in the future if the same DSPA name gets created which basically tackles your second question.

How about concurrent deletion and creation using the same name? How the code is handling that?

Line 185 and 186 in dspipeline_controller.go is specifically created to tackle this condition.

Copy link
Contributor

@diegolovison diegolovison Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we have a unit test for this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach for this. Storing this sort of information memory is not scalable, and even though we can store a lot of dspa names, we would just lose that information upon a pod restart. Concurrency can be another issue as diego pointed out in the future if we scaled the operator.

the offending piece of code should probably be here:

https://github.com/opendatahub-io/data-science-pipelines-operator/blob/main/controllers/dspipeline_controller.go#L531-L537

Essentially we requeue after the cr has been deleted because of pod changes. If we can find a way to not queue after a cr is deleted that would be great, this would be the optimal approach. Have you considered this? if so, I'd be curious to know why we couldn't go this route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HumairAK Unfortunately, reading through there is no such way as the SetupManager still looks for the said resources as it belongs to the CR after deletion and is being watched by the Controller as specified (https://github.com/opendatahub-io/data-science-pipelines-operator/blob/main/controllers/dspipeline_controller.go#L507)

I tried using the predicates mentioned here (kubernetes-sigs/kubebuilder#1700 (comment)) to see if that helped filter out the Resources which are already deleted like the code snippet mentioned below but that doesn't help as mentioned in the Operator-SDK doc below.

This is an implementation of the controller that simply filters Delete events on Pods that have been confirmed deleted; the controller receives all Delete events that occur, and we may only care about resources that have not been completely deleted

func ignoreDeletionPredicate() predicate.Predicate {
	return predicate.Funcs{
		UpdateFunc: func(e event.UpdateEvent) bool {
			// Ignore updates to CR status in which case metadata.Generation does not change
			return e.ObjectOld.GetGeneration() != e.ObjectNew.GetGeneration()
		},
		DeleteFunc: func(e event.DeleteEvent) bool {
			// Evaluates to false if the object has been confirmed deleted.
			return !e.DeleteStateUnknown
		},
	}
}

func (r *DSPAReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&dspav1alpha1.DataSciencePipelinesApplication{}).
		Owns(&appsv1.Deployment{}).
		....
		....
		....
		WithEventFilter(ignoreDeletionPredicate()).
		Complete(r)
}

But reading through your comment, instead of creating a separate variable to store in memory with the flag, is there any harm in storing a bool flag in the DataSciencePipelinesApplication struct itself here (https://github.com/opendatahub-io/data-science-pipelines-operator/blob/main/api/v1alpha1/dspipeline_types.go#L404) and leverage the same to stop reconcilation in dspipeline_controller.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this happens when dspa is deleted, so adding a bool to dspa is not going to help I think.

Another thought I had was utilizing finalizers for this. Having DSPA only remove the finalizer when all pods are cleaned up, but I don't think it makes sense to block DSPA deletion on that, it would have an unwanted impact to deletion times in the UI.

Since we are already exiting reconcile early if we detect DSPA is not found and not reporting an error. I'm thinking we just convert the "DSPA not found" log to a debug, and just disable debug logs by default (we should be doing this anyway).

I would like to hear others' thoughts on this approach though. @gregsheremeta or @gmfrasca any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finalizers is the correct approach.

1 kube delete is called on DSPA. deletionTimestamp is set
2 UI treats any DSPA with a deletionTimestamp as deleted. "Pipeline Server" is no longer rendered almost immediately.
3 Resources are cleaned up.
4 resources in finalizer are gone -> finalizer is removed -> DSPA is actually deleted from kube apiserver

I've never coded this myself, so I might have this a litte imperfect, but this is roughly the pattern I've seen in production kube systems I've worked on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is roughly the pattern I've seen in production kube systems I've worked on

I do concede, though, that this was quite the nightmare, because things were constantly getting stuck

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are already exiting reconcile early if we detect DSPA is not found and not reporting an error. I'm thinking we just convert the "DSPA not found" log to a debug, and just disable debug logs by default (we should be doing this anyway).

Ok, yeah, I'm sold. I guess this drastically changes this PR -- sorry Achyut :)

so l ine 180 below, change to roughly

log.Debug("DSPA resource was not found, assuming it was recently deleted, nothing to do here")

and then set DSPO default to debug logging (probably in a different PR 😄)

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@amadhusu amadhusu changed the title Stop DSPO reconcilation after DSPA deletion Ensure DSPO Logging Default to Debug and modify the "DSPA resource was not found" message to Debug Jul 30, 2024
@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

1 similar comment
@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

Achyut Madhusudan added 2 commits July 31, 2024 19:42
Signed-off-by: Achyut Madhusudan <amadhusu@redhat.com>
Signed-off-by: Achyut Madhusudan <amadhusu@redhat.com>
@dsp-developers
Copy link
Contributor

Change to PR detected. A new PR build was completed.
A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678

@amadhusu amadhusu changed the title Ensure DSPO Logging Default to Debug and modify the "DSPA resource was not found" message to Debug Disable DSPO Debug Logging by Default and modify the "DSPA resource was not found" message to Debug Jul 31, 2024
@amadhusu amadhusu changed the title Disable DSPO Debug Logging by Default and modify the "DSPA resource was not found" message to Debug Disable DSPO Debug Logs by Default and modify the "DSPA resource was not found" message to Debug Jul 31, 2024
Copy link
Contributor

openshift-ci bot commented Jul 31, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: diegolovison
Once this PR has been reviewed and has the lgtm label, please assign accorvin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@diegolovison
Copy link
Contributor

/lgtm

@VaniHaripriya
Copy link
Contributor

/verified
Tested as per instructions and I could see only the info logs. Attaching the logs for reference(appropriate logs from line 950)data-science-pipelines-operator-controller-manager-d86cdf7tfpbx-manager.log

@HumairAK
Copy link
Contributor

HumairAK commented Aug 7, 2024

/hold

@gregsheremeta
Copy link
Contributor

gonna split this up into separate tasks

@gregsheremeta
Copy link
Contributor

/close

@openshift-ci openshift-ci bot closed this Aug 7, 2024
Copy link
Contributor

openshift-ci bot commented Aug 7, 2024

@gregsheremeta: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants