Disable DSPO Debug Logs by Default and modify the "DSPA resource was not found" message to Debug #678

amadhusu · 2024-07-25T15:03:47Z

The issue resolved by this Pull Request:

Description of your changes:

When deleting a DSPA we continue to see these types of logs:

2023-08-23T17:17:40Z DEBUG Reconcile event triggered by [Pod: mariadb-pipelines-definition-6d848d8d75-hg6hq] {"namespace": "my-ds-project"}
2023-08-23T17:17:40Z DEBUG Reconcile event triggered by [Pod: mariadb-pipelines-definition-6d848d8d75-hg6hq] {"namespace": "my-ds-project"}
2023-08-23T17:17:40Z DEBUG DataSciencePipelinesApplication Reconciler called. {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z INFO DSPA resource was not found {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z DEBUG Reconcile event triggered by [Pod: mariadb-pipelines-definition-6d848d8d75-hg6hq] {"namespace": "my-ds-project"}
2023-08-23T17:17:40Z DEBUG DataSciencePipelinesApplication Reconciler called. {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z INFO DSPA resource was not found {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z DEBUG DataSciencePipelinesApplication Reconciler called. {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"}
2023-08-23T17:17:40Z INFO DSPA resource was not found {"namespace": "my-ds-project", "dspa_name": "pipelines-definition"

This PR resolves this by ensuring DSPO logs default to Debug logs with the first commit.
The second commit modifies the "DSPA resources was not found" message to a Debug instead of an Info.

Testing instructions

Deploy the DSPO created by Github
Deploy DSPA
(Optional) Deploy additional DSPA in another namespace
Delete the original DSPA
Observe the DSPO logs to ensure that "DEBUG DSPA resource was not found" does not appear

Checklist

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

dsp-developers · 2024-07-25T15:06:00Z

A new image has been built to help with testing out this PR: quay.io/opendatahub/data-science-pipelines-operator:pr-678
An OCP cluster where you are logged in as cluster admin is required.

To use this image run the following:

cd $(mktemp -d)
git clone git@github.com:opendatahub-io/data-science-pipelines-operator.git
cd data-science-pipelines-operator/
git fetch origin pull/678/head
git checkout -b pullrequest bd8b98f0fcf548b0adb3ae855cc78f6fe14a90e1
oc new-project opendatahub
make deploy IMG="quay.io/opendatahub/data-science-pipelines-operator:pr-678"

More instructions here on how to deploy and test a Data Science Pipelines Application.

diegolovison · 2024-07-25T15:12:46Z

controllers/config/defaults.go

@@ -178,6 +178,8 @@ const DefaultMaxConcurrentReconciles = 10

 const DefaultRequeueTime = time.Second * 20

+var DSPAdeleted = make(map[string]bool)


I have a few question:

Are we going to store all DSPA forever?

How about concurrent deletion and creation using the same name? How the code is handling that?

Are we going to store all DSPA forever?

We are storing just to compare in the future if the same DSPA name gets created which basically tackles your second question.

How about concurrent deletion and creation using the same name? How the code is handling that?

Line 185 and 186 in dspipeline_controller.go is specifically created to tackle this condition.

May we have a unit test for this change?

I don't think this is the right approach for this. Storing this sort of information memory is not scalable, and even though we can store a lot of dspa names, we would just lose that information upon a pod restart. Concurrency can be another issue as diego pointed out in the future if we scaled the operator.

the offending piece of code should probably be here:

https://github.com/opendatahub-io/data-science-pipelines-operator/blob/main/controllers/dspipeline_controller.go#L531-L537

Essentially we requeue after the cr has been deleted because of pod changes. If we can find a way to not queue after a cr is deleted that would be great, this would be the optimal approach. Have you considered this? if so, I'd be curious to know why we couldn't go this route.

@HumairAK Unfortunately, reading through there is no such way as the SetupManager still looks for the said resources as it belongs to the CR after deletion and is being watched by the Controller as specified (https://github.com/opendatahub-io/data-science-pipelines-operator/blob/main/controllers/dspipeline_controller.go#L507)

I tried using the predicates mentioned here (kubernetes-sigs/kubebuilder#1700 (comment)) to see if that helped filter out the Resources which are already deleted like the code snippet mentioned below but that doesn't help as mentioned in the Operator-SDK doc below.

This is an implementation of the controller that simply filters Delete events on Pods that have been confirmed deleted; the controller receives all Delete events that occur, and we may only care about resources that have not been completely deleted

func ignoreDeletionPredicate() predicate.Predicate { return predicate.Funcs{ UpdateFunc: func(e event.UpdateEvent) bool { // Ignore updates to CR status in which case metadata.Generation does not change return e.ObjectOld.GetGeneration() != e.ObjectNew.GetGeneration() }, DeleteFunc: func(e event.DeleteEvent) bool { // Evaluates to false if the object has been confirmed deleted. return !e.DeleteStateUnknown }, } } func (r *DSPAReconciler) SetupWithManager(mgr ctrl.Manager) error { return ctrl.NewControllerManagedBy(mgr). For(&dspav1alpha1.DataSciencePipelinesApplication{}). Owns(&appsv1.Deployment{}). .... .... .... WithEventFilter(ignoreDeletionPredicate()). Complete(r) }

But reading through your comment, instead of creating a separate variable to store in memory with the flag, is there any harm in storing a bool flag in the DataSciencePipelinesApplication struct itself here (https://github.com/opendatahub-io/data-science-pipelines-operator/blob/main/api/v1alpha1/dspipeline_types.go#L404) and leverage the same to stop reconcilation in dspipeline_controller.go

Well this happens when dspa is deleted, so adding a bool to dspa is not going to help I think.

Another thought I had was utilizing finalizers for this. Having DSPA only remove the finalizer when all pods are cleaned up, but I don't think it makes sense to block DSPA deletion on that, it would have an unwanted impact to deletion times in the UI.

Since we are already exiting reconcile early if we detect DSPA is not found and not reporting an error. I'm thinking we just convert the "DSPA not found" log to a debug, and just disable debug logs by default (we should be doing this anyway).

I would like to hear others' thoughts on this approach though. @gregsheremeta or @gmfrasca any thoughts?

Finalizers is the correct approach.

1 kube delete is called on DSPA. deletionTimestamp is set
2 UI treats any DSPA with a deletionTimestamp as deleted. "Pipeline Server" is no longer rendered almost immediately.
3 Resources are cleaned up.
4 resources in finalizer are gone -> finalizer is removed -> DSPA is actually deleted from kube apiserver

I've never coded this myself, so I might have this a litte imperfect, but this is roughly the pattern I've seen in production kube systems I've worked on.

this is roughly the pattern I've seen in production kube systems I've worked on

I do concede, though, that this was quite the nightmare, because things were constantly getting stuck

Since we are already exiting reconcile early if we detect DSPA is not found and not reporting an error. I'm thinking we just convert the "DSPA not found" log to a debug, and just disable debug logs by default (we should be doing this anyway).

Ok, yeah, I'm sold. I guess this drastically changes this PR -- sorry Achyut :)

so l ine 180 below, change to roughly

log.Debug("DSPA resource was not found, assuming it was recently deleted, nothing to do here")

and then set DSPO default to debug logging (probably in a different PR 😄)

dsp-developers · 2024-07-25T15:14:03Z