more operator fixes #584

avalluri · 2020-04-11T23:29:03Z

This PR contains below fixes:

validating deployment certificates
validating deployment driver mode
using operator image as the default driver image
do not delete CRD on exit if any active deployments
fix /sys mounting in direct mode
handle reconcile on restart operator

pohly · 2020-04-14T08:16:56Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

@@ -264,6 +275,22 @@ func (d *PmemCSIDriver) initDeploymentSecrests(r *ReconcileDeployment) error {
 	return nil
 }

+func validateCertificate(encodedCert []byte, certType, commonName string) error {


This function duplicates some of the checks that the Go tls package will do when actually trying to establish a connections. However, how can we be sure that this functions checks everything?

I don't think we can. Therefore I propose to implement this check differently: instead of checking each certificate in isolation, simulate establishing the same connections that PMEM-CSI will do later on and report an error if any of those fail.

Added the check to verify that the provided keys/certificates are valid by initiating a client(with nod key/certificate) connection to the server(with registry key/certificate) get succeeds.

pohly · 2020-04-14T08:17:37Z

pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go

@@ -469,6 +486,87 @@ var _ = Describe("Operator", func() {
 			Expect(s.Data[corev1.TLSPrivateKeyKey]).Should(Equal(encodedKey), "mismatched private key")
 		})

+		It("shall use provided privatekeys and certificates", func() {


"private keys"

pohly · 2020-04-14T08:18:35Z

pkg/pmem-csi-operator/pmem-tls/tls.go

+}
+
+// GenerateCertificateWithDuration returns a new certificate signed for given public key.
+// The duration of this certificate is with in the given notBore and notAfter bounds.


"notBefore"

pohly · 2020-04-14T08:19:10Z

pkg/pmem-csi-operator/pmem-tls/tls.go

+
+// GenerateCertificateWithDuration returns a new certificate signed for given public key.
+// The duration of this certificate is with in the given notBore and notAfter bounds.
+// Intented use this API is only by tests


"Intended use of"

pohly · 2020-04-14T08:26:42Z

operator/README.md

@@ -72,7 +72,11 @@ $ kubectl create -f https://github.com/intel/pmem-csi/raw/operator/operator/depl
 | nodeSelector | string map | [Labels to use for selecting Nodes](../README.md#run-pmem-csi-on-kubernetes) on which PMEM-CSI driver should run. | `{ "storage": "pmem" }`|
 | pmemPercentage | integer | Percentage of PMEM space to be used by the driver on each node. This is only valid for a driver deployed in `lvm` mode. This field cannot be changed of a running deployment. | 100 |

-<sup>1</sup> Image versions depend on the Kubernetes cluster version. The operator figures out itself the appropriate image version(s). Whereas users have to handle choosing the right version(s) themselves when overriding the values.
+<sup>1</sup> To use the same container image as default driver image the operator pod must set with below environment variables with appropriate values:
+ - POD_NAME: Name(´metadata.name`) of the operator pod


This Name(metadata.name) looks like some kind of function call and does not render properly in GitHub.

I would just leave out the (metadata.name).

pohly · 2020-04-14T08:43:27Z

test/e2e/operator/deployment_api.go

+			EnsureOperatorRemoved(c, o)
+
+			By("validating driver post operator delete")
+			validateDriverDeployment(f, o, deployment)


Isn't validating just once potentially racy? Deleting might have been triggered and just not done yet. It feels safer to test for a certain period.

Also, with the operator removed, what is going to remove the driver deployment after the test?

Modified the validation to be consistent for 1m. The Deployment should be deleted by defer at the end of test

pohly · 2020-04-14T08:44:39Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

@@ -103,6 +106,17 @@ func (d *PmemCSIDriver) Reconcile(r *ReconcileDeployment) (bool, error) {
 		// Deployment successfull, so no more reconcile needed for this deployment
 		return false, nil
 	case api.DeploymentPhaseRunning:
+		if !foundInCache {
+			// Possible that operator restarted, so we can't assume that
+			// everything upto date. Look at running deployment if any changes


"everything is up-to-date".

pohly · 2020-04-14T08:45:08Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

+		if !foundInCache {
+			// Possible that operator restarted, so we can't assume that
+			// everything upto date. Look at running deployment if any changes
+			// needed


"Check running deployment to determine if any changes are needed."

pohly · 2020-04-14T08:45:20Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

@@ -103,6 +106,17 @@ func (d *PmemCSIDriver) Reconcile(r *ReconcileDeployment) (bool, error) {
 		// Deployment successfull, so no more reconcile needed for this deployment
 		return false, nil
 	case api.DeploymentPhaseRunning:
+		if !foundInCache {
+			// Possible that operator restarted, so we can't assume that


pohly · 2020-04-14T09:00:29Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

+				return true, err
+			}
+
+			changes = d.Compare(oldDeployment)


I suspect there is one situation that isn't handled by just looking at the deployment spec.

Consider the case where we upgrade the operator and that the new operator will change the driver deployment somehow, for example by adding a new --foobar parameter for one of the sidecars. This difference between the running driver deployment and the one that the new operator would create is not detected, is it?

What should happen is that the operator should update the objects that it creates such that they match exactly what it would create from scratch. The simplest (?) solution would be to just do that unconditionally each time the operator starts. The apps operator can then check whether it really needs to restart pods.

This logic hinges on both operator versions creating exactly the same objects. At one point me might add some new object (for example, a service) that the old operator doesn't know about. In case of a downgrade, it then would leave that service object unchanged although it's not supposed to be part of the downgraded driver deployment.

I don't have a good solution for this because there is no generic "list all objects with certain labels" API in Kubernetes. One has to do that for all relevant object types, and "relevant" is unknown for a downgrade (the older operator would need to know what will be added in the future). However, we can implement this for an upgrade because then we know all types that may have been created in previous versions.

I suspect there is one situation that isn't handled by just looking at the deployment spec.
Consider the case where we upgrade the operator and that the new operator will change the driver deployment somehow, for example by adding a new --foobar parameter for one of the sidecars. This difference between the running driver deployment and the one that the new operator would create is not detected, is it?

Wouldn't be better to separate normal operator restart from version upgrade? say for version upgrades we could deal by adding an annotation to Deployment object, so on recocile we compare versions and decide if needs to re-deploy a Deployment.

What is the advantage over simply re-deploying after a restart? That then covers both scenarios with less code.

On re-deployment we might not find and reject if any conflicting deployment changes. Say, driverMode/pmemPercentage changed when the operator is not running.

So for invariant attributes we must still check the running driver deployment and update the deployment object accordingly. But then we can simply update all generated objects.

Made the changes as suggested, so now deployment reconcile on operator restart refreshes the objects and checks if any incompatible changes to be rejected.

avalluri · 2020-04-16T10:00:37Z

@pohly I believe I addressed all the points you mentioned. Can you please have a look, If it's ok to merge.

pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go

pkg/pmem-csi-operator/pmem-tls/tls.go

pkg/pmem-grpc/grpc.go

pohly · 2020-04-16T12:47:37Z

pkg/pmem-csi-operator/controller/deployment/deployment_controller.go

-	if err := deployment.EnsureDefaults(); err != nil {
+	driverImage, err := r.ContainerImage()
+	if err != nil {
+		klog.Warningf("Failed to find the operator image: %v", err)


This should be treated as failure because clearly something is wrong.

If we want to treat this as error then, we should move this call to the operator initialization phase, instead of reconciling loop. Ok, i will do the needed changes.

Done, As per now new implementation, the operator exists if fail to get self-image.

operator/README.md

pohly · 2020-04-16T13:02:04Z

test/e2e/operator/deployment_api.go

+	}
+
+	framework.Logf("Deleting the operator '%s/%s' running", dep.Namespace, dep.Name)
+	if err = f.ClientSet.AppsV1().Deployments(dep.Namespace).Delete(dep.Name, nil); err != nil {


It'll be simpler to just repeat the Delete call until you get a NotFound error. As an added bonus, the code becomes more robust against temporary delete failures.

Done. Instead of deleting deployment, making its replication count to zero and waiting till the operator pod gets deleted in a repeated loop.

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

pohly · 2020-04-16T13:09:38Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

-			err = e
+	// Force update all deployment objects
+	if updateAll {
+		klog.Infof("Updating all objects for deployment %q", d.Name)


Please add a test for this case. A thorough test ensures that all objects are slightly different than they should be before the operator starts and afterwards pass the validateCSIDriver function.

That function itself is not as thorough as it should be (only checks some fields, but not everything); you can ignore that for now, I am working on it.

Added a test that:

stats a deployment with defaults

removes operator pod

updates all deployment fields

restarts pod

wait till the deployment timestamp gets updated and ensures that all new values are applied to its sub-objects.

pohly · 2020-04-16T13:19:40Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

+	if updateAll {
+		klog.Infof("Updating all objects for deployment %q", d.Name)
+		for _, obj := range d.getDeploymentObjects(r) {
+			if e := r.Update(obj); e != nil {


This doesn't update the secrets. Let's merge #586 and then getDeploymentObjects truly returns all objects.

Another thought: what happens if the object doesn't exist because the operator we update from didn't create it?

Secrets update is a bit tricky, With the current code, for a deployment with no keys/certificates set, It creates a new set of secrets for every call.

Strictly speaking of this function, it only deals with the objects we created in the Initializing phase, so this will not/need not update the secrets.

I don't follow. We need to update all objects only if we found an existing deployment in an unknown state. Why would that then get triggered for every call of the function?

We end up in this function, only when the deployment is in "Running" phase. So for that deployment secretes must have already created and needs no update. And moreover, secrets creation is not idempotent in case of a deployment using operator generated keys/certificates.

We end up in this function, only when the deployment is in "Running" phase.

Which is a problem. We need to update stale objects regardless of the phase.

Let's do this:

review and merge operator: simplify object creation #586 because it simplifies where we create objects

remove the "operator: reload state of a running deployment if not found in cache" from this PR because the rest should be okay

write E2E test cases for reconciling on startup, then check whether the new code passes those tests

pohly · 2020-04-16T13:20:51Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

+	if updateAll {
+		klog.Infof("Updating all objects for deployment %q", d.Name)
+		for _, obj := range d.getDeploymentObjects(r) {
+			if e := r.Update(obj); e != nil {


Another thought: what happens if the object doesn't exist because the operator we update from didn't create it?

pohly · 2020-04-16T13:22:15Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

+
+// buildFromPreDeployed builds a new Deployment object by fetching details
+// from existing deplyment sub-resources.
+func (d *PmemCSIDriver) buildFromPreDeployed(r *ReconcileDeployment) (*api.Deployment, error) {


Same here: if an object cannot be found, we shouldn't treat that as an error because otherwise the driver deployment gets stuck forever.

Instead, the operator must recover by using the desired deployment parameters for those objects which don't exist, then later create them.

Needs a test...

Another thought: what happens if the object doesn't exist because the operator we update
from didn't create it?

To handle such cases we could change the semantics of Update()/Create() such that it creates if not exists else it updates.

Same here: if an object cannot be found, we shouldn't treat that as an error because otherwise the driver deployment gets stuck forever.
Instead, the operator must recover by using the desired deployment parameters for those objects which don't exist, then later create them.

I believe this is how the current implementation of this function treats, it doesn't return any error for an object that was not found. Can you please be specific which part of the code does not follow this.

pohly · 2020-04-17T07:57:06Z

pkg/pmem-csi-operator/controller/deployment/deployment_controller.go

+		// that was created by the same deployment
+		klog.Infof("Updating: %q of type %q ", metaObj.GetName(), obj.GetObjectKind().GroupVersionKind())
+		// Update existing active object,
+		return r.client.Update(context.TODO(), obj)


I doubt that the apiserver will accept the update because the resourceVersion won't match.

Write an E2E test for this and we'll know...

Yes, true. I fixed this by choosing relevant metadata from the object returned by the API server to our new object before update.

It could be used by operator.

Added a check for validating the provided deployment keys and certificates are valid to run the driver. This is done by starting a server with given registry certificates and a client to connect to that server with node-controller certificates. If the connection success then we treat the certificates are valid.

If the provided device mode is not in our supported list, the deployment should fail with the appropriate error. In practice this check is not required as this could be handled by JSON schema validation, but still good to have to avoid any back-doors.

By retrieving the operator pod and it's container image we could ensure that the exact same image is used for deploying the driver. For this to work the operator must provide with these environment variables: - POD_NAME: name of the operator pod(metadata.name) - OPERATOR_NAME: name of the operator container, defaults to `pmem-csi-operator` FIXES: intel#576, intel#578

Deleting CRD deletes all active resources which results in restarting operator pod deletes the driver. So we only delete CRD if no active deployments found. FIXES: intel#579

Added new status field 'lastUpdated' that holds the time stamp when the last the deployment's got updated. In other words when was the last time the deployment got reconciled.

In case of operator restart we refresh all the objects of a reconciling deployment, to handle the case of operator upgrades that results in changes to running driver resources. But at the same time we should be able to handle the conflicting changes done to deployment, during the time of operator restart. To find those changes we will not have a cached object to represent the exisitng deployment. So we prepare the deployment by looking into it's running sub-resources, we use that for finding the diff.

Deployment reconcile should be able to detect and recover any missing objects. We could achieve this by changing the semantics of Reconciler.Create() such that it updates the object if it's already created else it creates the new one.

avalluri · 2020-04-18T10:03:32Z

I suspect the test failure is related to #593. @pohly Can you please check if it's the real issue or something I messed up in my tests.

This adds missing certificates check in reconcile loop, so that it updates if any such change with new secrets. This also fixes a couple of issues discovred in earlier commits.

For retrieving cluster scoped objects we should not namespace.

Not required as we are not using per test namespace. All tests run in default namespace.

Operator tests are expected to run with new operator per tests, so that we can ensure a test is not effected by any leftover previous test data. But test deployment reuses previously running operator. This change ensues that the operator gets deleted on end of every test. Also made changes to reuse deploy.Cluster object and adjusted the time intervals for tests.

Auto-reconciling of a failed deployment due to driver name clash is not required. Because unless user resolves that clash by updating the deployment it can not be succeeded. So reconcile only when that deployment got updated.

pohly · 2020-04-19T08:29:34Z

operator: do not requeue reconcile on driver name clash

That commit argues that reconciliation can be stopped until the user updates the deployment spec. But what if the user resolves the conflict by removing the other driver instance? Then the deployment becomes usable without updating it.

pohly · 2020-04-19T08:31:36Z

operator: fix test failures

Which failure is that fixing? Did some other commit change the behavior? Then this commit should be squashed into that other one.

pohly

This PR has become too large and changes many things at once. Can you break it up into smaller PRs? If there are code conflicts, then let's merge one change after the other.

pohly · 2020-04-19T08:37:06Z

I suspect the test failure is related to #593.

Did you change something? The latest test run was successful.

pohly · 2020-04-19T08:42:07Z

test/e2e/operator/deployment_api.go

+			validateDriverDeployment(f, d, deployment)
+
+			// Stop the operator
+			deleteOperator(c, d)


"stopOperator" is a better name for the function. Then you don't need the comment saying that "delete = stop"...

pohly · 2020-04-19T08:42:48Z

test/e2e/operator/deployment_api.go

+
+			By("Restarting the operator deployment...")
+			// Start the operator
+			createOperator(c, d)


createOperator -> startOperator

avalluri · 2020-04-24T12:04:01Z

All of the commits except(146cde6) are merged through other PRs, so I would close this PR.

avalluri force-pushed the validate-certs branch 7 times, most recently from 7cab808 to 5934b15 Compare April 13, 2020 00:00

pohly mentioned this pull request Apr 13, 2020

operator: allow restarting or updating without killing driver installations #579

Closed

avalluri force-pushed the validate-certs branch from 5934b15 to 806914a Compare April 13, 2020 10:45

avalluri changed the title ~~WIP: more operator fixes~~ more operator fixes Apr 13, 2020

This was referenced Apr 13, 2020

operator: test changing fixed fields while operator is stopped #582

Closed

operator: driver not functional in direct mode #581

Closed

operator: default image #578

Closed

operator: example uses broken image #576

Closed

avalluri requested a review from pohly April 13, 2020 18:21

pohly suggested changes Apr 14, 2020

View reviewed changes

avalluri force-pushed the validate-certs branch from 806914a to 3acc33d Compare April 15, 2020 15:47

avalluri mentioned this pull request Apr 16, 2020

test: unify handling of driver and operator deployment #575

Merged

avalluri force-pushed the validate-certs branch 3 times, most recently from ffbe3d6 to 2178734 Compare April 16, 2020 09:24

avalluri requested a review from pohly April 16, 2020 10:00

pohly suggested changes Apr 16, 2020

View reviewed changes

pohly reviewed Apr 16, 2020

View reviewed changes

pohly suggested changes Apr 16, 2020

View reviewed changes

avalluri force-pushed the validate-certs branch 3 times, most recently from 336b0a4 to a8b9901 Compare April 16, 2020 21:32

pohly reviewed Apr 17, 2020

View reviewed changes

avalluri added 8 commits April 18, 2020 01:29

Move NonBlockingGRPCServer to a separate package

e7c649d

It could be used by operator.

operator: do not delete CRD if any deployments are in active

2750fbe

Deleting CRD deletes all active resources which results in restarting operator pod deletes the driver. So we only delete CRD if no active deployments found. FIXES: intel#579

operator: Added last updated timestamp for the deployment status

f1b5a05

Added new status field 'lastUpdated' that holds the time stamp when the last the deployment's got updated. In other words when was the last time the deployment got reconciled.

avalluri force-pushed the validate-certs branch from 85fa55e to 1269370 Compare April 18, 2020 02:19

avalluri force-pushed the validate-certs branch from 7654452 to 7e4bcb3 Compare April 18, 2020 16:05

avalluri added 5 commits April 18, 2020 23:27

operator: capture updated certificates on reconcile

d239fb2

This adds missing certificates check in reconcile loop, so that it updates if any such change with new secrets. This also fixes a couple of issues discovred in earlier commits.

operator: fix test failures

219c4c8

For retrieving cluster scoped objects we should not namespace.

operator: more verbose while create/update/delete objects

6b47089

operator/test: ask framework to skip namespace creation

48e98f5

Not required as we are not using per test namespace. All tests run in default namespace.

avalluri force-pushed the validate-certs branch from 7e4bcb3 to 97ddf69 Compare April 18, 2020 20:51

operator: do not requeue reconcile on driver name clash

146cde6

Auto-reconciling of a failed deployment due to driver name clash is not required. Because unless user resolves that clash by updating the deployment it can not be succeeded. So reconcile only when that deployment got updated.

pohly suggested changes Apr 19, 2020

View reviewed changes

pohly mentioned this pull request Apr 19, 2020

operator/tests: Each operator test should run with a fresh instance of operator #593

Closed

pohly reviewed Apr 19, 2020

View reviewed changes

This was referenced Apr 19, 2020

Repair deployments #599

Closed

Repair deployments #600

Merged

avalluri closed this Apr 24, 2020

krisiasty mentioned this pull request Oct 14, 2020

pmem-csi-intel-com-node pods fails to start in direct mode #786

Closed

avalluri deleted the validate-certs branch October 20, 2020 20:34

more operator fixes #584

more operator fixes #584

Conversation

avalluri commented Apr 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avalluri Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avalluri commented Apr 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly Apr 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avalluri commented Apr 18, 2020

pohly commented Apr 19, 2020

pohly commented Apr 19, 2020

pohly left a comment

Choose a reason for hiding this comment

pohly commented Apr 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avalluri commented Apr 24, 2020

avalluri commented Apr 11, 2020 •

edited

Loading

avalluri Apr 14, 2020 •

edited

Loading

pohly Apr 16, 2020 •

edited

Loading