NETOBSERV-1358: splitting controllers #503

jotak · 2023-11-28T16:25:41Z

Description

Start with a new "Monitoring" controllers that deals with ServiceMonitors / Dashboards etc
Create a Status Manager that gathers all statuses from each controller
Each status becomes eventually converted in a Condition (k8s status API), plus there is a global "Ready" status that is a merge of each component status

edit: a next commit also extracted FLP reconciler as a new controller

Previously

We had only 1 controller, the FlowCollector controller, which managed all deployed components (FLP, console plugin, agents and a few other resources such as the monitoring dashboards)

The code was structured with each component managed by a "Reconciler". In terms of code structure, "reconcilers" are similar to "controllers", but at runtime it differs a lots because all "reconcilers" are managed synchronously, called from the same Reconcile loop, sharing reconcile events / cache & watches / etc.

Each reconciler were amending a global Status via hooks called SetChanged / SetInProgress. On any error, the single whole reconcile loop would return, setting an error in status conditions.

The FlowCollector status was amended at the end of the single reconcile loop to reflect any new status.

Now, with this PR

We have 2 controllers, and we plan to create more in follow-up work. There's still the main / legacy FlowCollector controller, from which I extracted functions related to the monitoring stack (dashboards, roles, annotating namespace, etc.). These functions are now managed by the new controller called Monitoring.

So each controller have their own configuration related to cache / watches, their own reconcile loop and requests queue. Their reconciliation code is run asynchronously from each other.

Status management still needs to be coordinated, since each controller may need to change the global status. This is achieved with a new Status Manager that holds a map keeping track of each controller's status, in a thread-safe way.

Each controller can call status manager's functions to update their component's status, such as with r.status.SetFailure("MonitoringError", err.Error()) or r.status.SetReady().

Typically, in a reconcile loop, a controller would:

start setting status as Unknown, or Unset if this controller isn't used (e.g. a future IPFIX controller when agent.type is EBPF)
call defer r.status.Commit(ctx, r.Client) to commit (synchronize) status at the end of Reconcile
wire any deployment/daemonset it installs with status hooks CheckDaemonSetProgress and CheckDeploymentProgress (they are called in shared functions ReconcileDeployment / ReconcileDaemonSet)
call r.status.SetFailure or r.status.Error (<= syntactic sugar) on error

When the Commit function is called, the Status Manager will merge all statuses into a global one, and write it down to the FlowCollector CR

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci-robot · 2023-11-28T16:25:45Z

msherif1234 · 2023-11-29T10:36:05Z

pkg/manager/status/status_manager.go

+type Manager struct {
+	sync.Mutex
+
+	statuses map[ComponentName]ComponentStatus


can u use syncMap here ?

sounds good ... sometimes I don't like using sync.Map as the API is less user-friendly, and they don't always offer better performances ; but I just tried here and it plays nicely

openshift-ci-robot · 2023-11-29T11:15:12Z

@jotak: This pull request references NETOBSERV-1358 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

Description

Start with a new "Monitoring" controllers that deals with ServiceMonitors / Dashboards etc

Create a Status Manager that gathers all statuses from each controller

Each status becomes eventually converted in a Condition (k8s status API), plus there is a global "Ready" status that is a merge of each component status

Previously

We had only 1 controller, the FlowCollector controller, which managed all deployed components (FLP, console plugin, agents and a few other resources such as the monitoring dashboards)

The code was structured with each component managed by a "Reconciler". In terms of code structure, "reconcilers" are similar to "controllers", but at runtime it differs a lots because all "reconcilers" are managed synchronously, called from the same Reconcile loop, sharing reconcile events / cache & watches / etc.

Each reconciler were amending a global Status via hooks called SetChanged / SetInProgress. On any error, the single whole reconcile loop would return, setting an error in status conditions.

The FlowCollector status was amended at the end of the single reconcile loop to reflect any new status.

Now, with this PR

We have 2 controllers, and we plan to create more in follow-up work. There's still the main / legacy FlowCollector controller, from which I extracted functions related to the monitoring stack (dashboards, roles, annotating namespace, etc.). These functions are now managed by the new controller called Monitoring.

So each controller have their own configuration related to cache / watches, their own reconcile loop and requests queue. Their reconciliation code is run asynchronously from each other.

Status management still needs to be coordinated, since each controller may need to change the global status. This is achieved with a new Status Manager that holds a map keeping track of each controller's status, in a thread-safe way.

Each controller can call status manager's functions to update their component's status, such as with r.status.SetFailure("MonitoringError", err.Error()) or r.status.SetReady().

Typically, in a reconcile loop, a controller would:

start setting status as Unknown, or Unset if this controller isn't used (e.g. a future IPFIX controller when agent.type is EBPF)

call defer r.status.Commit(ctx, r.Client) to commit (synchronize) status at the end of Reconcile

wire any deployment/daemonset it installs with status hooks CheckDaemonSetProgress and CheckDeploymentProgress (they are called in shared functions ReconcileDeployment / ReconcileDaemonSet)

call r.status.SetFailure or r.status.Error (<= syntactic sugar) on error

When the Commit function is called, the Status Manager will merge all statuses into a global one, and write it down to the FlowCollector CR

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov · 2023-11-29T13:26:31Z

Codecov Report

Attention: 271 lines in your changes are missing coverage. Please review.

Comparison is base (f8edd2b) 64.53% compared to head (312b03d) 66.77%.

Files	Patch %	Lines
controllers/reconcilers/reconcilers.go	59.86%	54 Missing and 5 partials ⚠️
controllers/flp/flp_transfo_reconciler.go	60.58%	37 Missing and 17 partials ⚠️
controllers/flp/flp_controller.go	72.72%	30 Missing and 12 partials ⚠️
controllers/monitoring/monitoring_controller.go	52.80%	31 Missing and 11 partials ⚠️
controllers/flowcollector_controller.go	71.73%	16 Missing and 10 partials ⚠️
pkg/manager/manager.go	63.04%	12 Missing and 5 partials ⚠️
controllers/flp/flp_ingest_reconciler.go	82.05%	3 Missing and 4 partials ⚠️
pkg/manager/status/status_manager.go	94.30%	6 Missing and 1 partial ⚠️
...trollers/consoleplugin/consoleplugin_reconciler.go	90.00%	0 Missing and 4 partials ⚠️
controllers/flp/flp_monolith_reconciler.go	90.00%	0 Missing and 4 partials ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #503      +/-   ##
==========================================
+ Coverage   64.53%   66.77%   +2.24%     
==========================================
  Files          58       63       +5     
  Lines        7066     7290     +224     
==========================================
+ Hits         4560     4868     +308     
+ Misses       2195     2125      -70     
+ Partials      311      297      -14

Flag	Coverage Δ
unittests	`66.77% <74.01%> (+2.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

OlivierCazade

Just a few comments.

If some comments are related to code you moved but did not change, please ignore them.

OlivierCazade · 2023-12-04T15:06:14Z

controllers/flowcollector_controller.go

 	}

 	// Check namespace changed
 	if ns != previousNamespace {
-		if err := r.handleNamespaceChanged(ctx, previousNamespace, ns, desired, &flpReconciler, &cpReconciler); err != nil {
-			return ctrl.Result{}, r.failure(ctx, conditions.CannotCreateNamespace(err), desired)
+		if previousNamespace != "" && r.mgr.HasConsolePlugin() {


Not related to this PR.

Handling namespace change create some complexity that is not necessary IMO.

Apparently it is possible to make a field immutable with kubebuilder pattern. I think it could be a nice compromise.
What do you think?

I agree that the namespace thing is a pain.. We can make it immutable yes - we then loose the ability to reconfigure it on the fly, but I guess it's quite a rare operation anyway, and the workaround would just be to manually delete then re-install the flowcollector at the desired namespace. That would interrupt flow collection, but it's also interrupted when you modify namespace, so.. yeah I think I'm good with that proposal :-)

pkg/manager/manager.go

controllers/flp/flp_controller.go

OlivierCazade · 2023-12-04T15:26:54Z

controllers/flp/flp_ingest_reconciler.go

+		serviceAccount: cmn.Managed.NewServiceAccount(name),
+		configMap:      cmn.Managed.NewConfigMap(configMapName(ConfKafkaIngester)),
+		roleBinding:    cmn.Managed.NewCRB(RoleBindingName(ConfKafkaIngester)),
+	}


The new Managed API make this part more readable, this is nice.

yeah I don't know why I didn't do that earlier :)

controllers/flp/flp_controller.go

pkg/helper/flowcollector.go

openshift-ci-robot · 2023-12-05T11:04:25Z

@jotak: This pull request references NETOBSERV-1358 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

Description

Start with a new "Monitoring" controllers that deals with ServiceMonitors / Dashboards etc

Create a Status Manager that gathers all statuses from each controller

Each status becomes eventually converted in a Condition (k8s status API), plus there is a global "Ready" status that is a merge of each component status

edit: a next commit also extracted FLP reconciler as a new controller

Previously

We had only 1 controller, the FlowCollector controller, which managed all deployed components (FLP, console plugin, agents and a few other resources such as the monitoring dashboards)

The code was structured with each component managed by a "Reconciler". In terms of code structure, "reconcilers" are similar to "controllers", but at runtime it differs a lots because all "reconcilers" are managed synchronously, called from the same Reconcile loop, sharing reconcile events / cache & watches / etc.

Each reconciler were amending a global Status via hooks called SetChanged / SetInProgress. On any error, the single whole reconcile loop would return, setting an error in status conditions.

The FlowCollector status was amended at the end of the single reconcile loop to reflect any new status.

Now, with this PR

We have 2 controllers, and we plan to create more in follow-up work. There's still the main / legacy FlowCollector controller, from which I extracted functions related to the monitoring stack (dashboards, roles, annotating namespace, etc.). These functions are now managed by the new controller called Monitoring.

So each controller have their own configuration related to cache / watches, their own reconcile loop and requests queue. Their reconciliation code is run asynchronously from each other.

Status management still needs to be coordinated, since each controller may need to change the global status. This is achieved with a new Status Manager that holds a map keeping track of each controller's status, in a thread-safe way.

Each controller can call status manager's functions to update their component's status, such as with r.status.SetFailure("MonitoringError", err.Error()) or r.status.SetReady().

Typically, in a reconcile loop, a controller would:

start setting status as Unknown, or Unset if this controller isn't used (e.g. a future IPFIX controller when agent.type is EBPF)

call defer r.status.Commit(ctx, r.Client) to commit (synchronize) status at the end of Reconcile

wire any deployment/daemonset it installs with status hooks CheckDaemonSetProgress and CheckDeploymentProgress (they are called in shared functions ReconcileDeployment / ReconcileDaemonSet)

call r.status.SetFailure or r.status.Error (<= syntactic sugar) on error

When the Commit function is called, the Status Manager will merge all statuses into a global one, and write it down to the FlowCollector CR

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

OlivierCazade

LGTM!

nathan-weinberg · 2023-12-05T16:50:09Z

/hold for premerge testing

nathan-weinberg · 2023-12-05T16:50:14Z

/ok-to-test

github-actions · 2023-12-05T16:53:57Z

New images:

quay.io/netobserv/network-observability-operator:8718484
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-8718484
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-8718484

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:8718484 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-8718484

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-8718484
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

nathan-weinberg · 2023-12-05T21:29:57Z

/label qe-approved

openshift-ci-robot · 2023-12-05T21:30:03Z

@jotak: This pull request references NETOBSERV-1358 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

Description

Start with a new "Monitoring" controllers that deals with ServiceMonitors / Dashboards etc

Create a Status Manager that gathers all statuses from each controller

Each status becomes eventually converted in a Condition (k8s status API), plus there is a global "Ready" status that is a merge of each component status

edit: a next commit also extracted FLP reconciler as a new controller

Previously

We had only 1 controller, the FlowCollector controller, which managed all deployed components (FLP, console plugin, agents and a few other resources such as the monitoring dashboards)

The code was structured with each component managed by a "Reconciler". In terms of code structure, "reconcilers" are similar to "controllers", but at runtime it differs a lots because all "reconcilers" are managed synchronously, called from the same Reconcile loop, sharing reconcile events / cache & watches / etc.

Each reconciler were amending a global Status via hooks called SetChanged / SetInProgress. On any error, the single whole reconcile loop would return, setting an error in status conditions.

The FlowCollector status was amended at the end of the single reconcile loop to reflect any new status.

Now, with this PR

We have 2 controllers, and we plan to create more in follow-up work. There's still the main / legacy FlowCollector controller, from which I extracted functions related to the monitoring stack (dashboards, roles, annotating namespace, etc.). These functions are now managed by the new controller called Monitoring.

So each controller have their own configuration related to cache / watches, their own reconcile loop and requests queue. Their reconciliation code is run asynchronously from each other.

Status management still needs to be coordinated, since each controller may need to change the global status. This is achieved with a new Status Manager that holds a map keeping track of each controller's status, in a thread-safe way.

Each controller can call status manager's functions to update their component's status, such as with r.status.SetFailure("MonitoringError", err.Error()) or r.status.SetReady().

Typically, in a reconcile loop, a controller would:

start setting status as Unknown, or Unset if this controller isn't used (e.g. a future IPFIX controller when agent.type is EBPF)

call defer r.status.Commit(ctx, r.Client) to commit (synchronize) status at the end of Reconcile

wire any deployment/daemonset it installs with status hooks CheckDaemonSetProgress and CheckDeploymentProgress (they are called in shared functions ReconcileDeployment / ReconcileDaemonSet)

call r.status.SetFailure or r.status.Error (<= syntactic sugar) on error

When the Commit function is called, the Status Manager will merge all statuses into a global one, and write it down to the FlowCollector CR

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jotak · 2023-12-06T08:20:18Z

/approve
thanks @OlivierCazade @nathan-weinberg !

openshift-ci · 2023-12-06T08:20:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jotak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- Start with a new "Monitoring" controllers that deals with ServiceMonitors / Dashboards etc - Create a Status Manager that gathers all statuses from each controller - Each status becomes eventually converted in a Condition (k8s status API), plus there is a global "Ready" status that is a merge of each component status

Also: - Centralize location of the kubebuilder rbac annotations in manager.go - Update the displayed column for CLI status with new conditions - Keep just 1 status per component, not 2 (merged errors & progress into a single status)

- Use status manager for FLP and sub-components (ingest/transfo/monolith) - Like status, namespace management needs to be per-controller. So I'm moving away from having a dedicated field in Status, and use per-component annotations instead - Simplify a bit the reconcilers "managed objects" stuff - Less verbose narrowcache and log context, better use named loggers

jotak requested review from msherif1234, OlivierCazade and jpinsonneau November 28, 2023 16:25

openshift-ci-robot added the jira/valid-reference label Nov 28, 2023

msherif1234 reviewed Nov 29, 2023

View reviewed changes

openshift-merge-robot added the needs-rebase label Nov 29, 2023

jotak force-pushed the controlers branch from b6bb476 to 95cfdce Compare November 29, 2023 11:48

openshift-merge-robot removed the needs-rebase label Nov 29, 2023

jotak force-pushed the controlers branch from 95cfdce to d34b7d7 Compare November 29, 2023 13:22

OlivierCazade reviewed Dec 4, 2023

View reviewed changes

jotak requested review from OlivierCazade and msherif1234 December 5, 2023 11:03

OlivierCazade previously approved these changes Dec 5, 2023

View reviewed changes

openshift-ci bot assigned OlivierCazade Dec 5, 2023

openshift-ci bot added the lgtm label Dec 5, 2023

openshift-ci bot added the do-not-merge/hold label Dec 5, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 5, 2023

openshift-ci bot added the qe-approved QE has approved this pull request label Dec 5, 2023

openshift-ci bot added the approved label Dec 6, 2023

openshift-merge-robot added the needs-rebase label Dec 6, 2023

jotak added 6 commits December 6, 2023 09:22

Add filter predicate for reconcile reqs

f6d21ce

Also: - Centralize location of the kubebuilder rbac annotations in manager.go - Update the displayed column for CLI status with new conditions - Keep just 1 status per component, not 2 (merged errors & progress into a single status)

Use sync.Map

a626654

rename package flp (because shorter)

03926d3

Address review feedback

312b03d

jotak dismissed OlivierCazade’s stale review via 312b03d December 6, 2023 08:23

jotak force-pushed the controlers branch from e323417 to 312b03d Compare December 6, 2023 08:23

openshift-ci bot removed the lgtm label Dec 6, 2023

openshift-merge-robot removed the needs-rebase label Dec 6, 2023

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 6, 2023

OlivierCazade approved these changes Dec 6, 2023

View reviewed changes

openshift-ci bot added the lgtm label Dec 6, 2023

jotak removed the do-not-merge/hold label Dec 6, 2023

openshift-merge-bot bot merged commit 79eb0be into netobserv:main Dec 6, 2023
8 checks passed

jotak deleted the controlers branch December 6, 2023 09:52

NETOBSERV-1358: splitting controllers #503

NETOBSERV-1358: splitting controllers #503

Conversation

jotak commented Nov 28, 2023 • edited Loading

Description

Previously

Now, with this PR

Dependencies

Checklist

openshift-ci-robot commented Nov 28, 2023 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

msherif1234 Nov 29, 2023

Choose a reason for hiding this comment

jotak Nov 29, 2023

Choose a reason for hiding this comment

openshift-ci-robot commented Nov 29, 2023 • edited by openshift-ci bot Loading

Description

Previously

Now, with this PR

Dependencies

Checklist

codecov bot commented Nov 29, 2023 • edited Loading

Codecov Report

OlivierCazade left a comment

Choose a reason for hiding this comment

OlivierCazade Dec 4, 2023

Choose a reason for hiding this comment

jotak Dec 5, 2023

Choose a reason for hiding this comment

OlivierCazade Dec 4, 2023

Choose a reason for hiding this comment

jotak Dec 5, 2023

Choose a reason for hiding this comment

openshift-ci-robot commented Dec 5, 2023 • edited by openshift-ci bot Loading

Description

Previously

Now, with this PR

Dependencies

Checklist

OlivierCazade left a comment

Choose a reason for hiding this comment

nathan-weinberg commented Dec 5, 2023

nathan-weinberg commented Dec 5, 2023

github-actions bot commented Dec 5, 2023

nathan-weinberg commented Dec 5, 2023

openshift-ci-robot commented Dec 5, 2023 • edited by openshift-ci bot Loading

Description

Previously

Now, with this PR

Dependencies

Checklist

jotak commented Dec 6, 2023

openshift-ci bot commented Dec 6, 2023

jotak commented Nov 28, 2023 •

edited

Loading

openshift-ci-robot commented Nov 28, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Nov 29, 2023 •

edited by openshift-ci bot

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading

openshift-ci-robot commented Dec 5, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Dec 5, 2023 •

edited by openshift-ci bot

Loading