-
Notifications
You must be signed in to change notification settings - Fork 55
Conversation
7cab808
to
5934b15
Compare
5934b15
to
806914a
Compare
@@ -264,6 +275,22 @@ func (d *PmemCSIDriver) initDeploymentSecrests(r *ReconcileDeployment) error { | |||
return nil | |||
} | |||
|
|||
func validateCertificate(encodedCert []byte, certType, commonName string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function duplicates some of the checks that the Go tls package will do when actually trying to establish a connections. However, how can we be sure that this functions checks everything?
I don't think we can. Therefore I propose to implement this check differently: instead of checking each certificate in isolation, simulate establishing the same connections that PMEM-CSI will do later on and report an error if any of those fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the check to verify that the provided keys/certificates are valid by initiating a client(with nod key/certificate) connection to the server(with registry key/certificate) get succeeds.
@@ -469,6 +486,87 @@ var _ = Describe("Operator", func() { | |||
Expect(s.Data[corev1.TLSPrivateKeyKey]).Should(Equal(encodedKey), "mismatched private key") | |||
}) | |||
|
|||
It("shall use provided privatekeys and certificates", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"private keys"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
// GenerateCertificateWithDuration returns a new certificate signed for given public key. | ||
// The duration of this certificate is with in the given notBore and notAfter bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"notBefore"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
// GenerateCertificateWithDuration returns a new certificate signed for given public key. | ||
// The duration of this certificate is with in the given notBore and notAfter bounds. | ||
// Intented use this API is only by tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Intended use of"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
operator/README.md
Outdated
@@ -72,7 +72,11 @@ $ kubectl create -f https://github.com/intel/pmem-csi/raw/operator/operator/depl | |||
| nodeSelector | string map | [Labels to use for selecting Nodes](../README.md#run-pmem-csi-on-kubernetes) on which PMEM-CSI driver should run. | `{ "storage": "pmem" }`| | |||
| pmemPercentage | integer | Percentage of PMEM space to be used by the driver on each node. This is only valid for a driver deployed in `lvm` mode. This field cannot be changed of a running deployment. | 100 | | |||
|
|||
<sup>1</sup> Image versions depend on the Kubernetes cluster version. The operator figures out itself the appropriate image version(s). Whereas users have to handle choosing the right version(s) themselves when overriding the values. | |||
<sup>1</sup> To use the same container image as default driver image the operator pod must set with below environment variables with appropriate values: | |||
- POD_NAME: Name(´metadata.name`) of the operator pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Name(metadata.name)
looks like some kind of function call and does not render properly in GitHub.
I would just leave out the (metadata.name)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
test/e2e/operator/deployment_api.go
Outdated
EnsureOperatorRemoved(c, o) | ||
|
||
By("validating driver post operator delete") | ||
validateDriverDeployment(f, o, deployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't validating just once potentially racy? Deleting might have been triggered and just not done yet. It feels safer to test for a certain period.
Also, with the operator removed, what is going to remove the driver deployment after the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the validation to be consistent for 1m. The Deployment should be deleted by defer
at the end of test
@@ -103,6 +106,17 @@ func (d *PmemCSIDriver) Reconcile(r *ReconcileDeployment) (bool, error) { | |||
// Deployment successfull, so no more reconcile needed for this deployment | |||
return false, nil | |||
case api.DeploymentPhaseRunning: | |||
if !foundInCache { | |||
// Possible that operator restarted, so we can't assume that | |||
// everything upto date. Look at running deployment if any changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"everything is up-to-date".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
if !foundInCache { | ||
// Possible that operator restarted, so we can't assume that | ||
// everything upto date. Look at running deployment if any changes | ||
// needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Check running deployment to determine if any changes are needed."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -103,6 +106,17 @@ func (d *PmemCSIDriver) Reconcile(r *ReconcileDeployment) (bool, error) { | |||
// Deployment successfull, so no more reconcile needed for this deployment | |||
return false, nil | |||
case api.DeploymentPhaseRunning: | |||
if !foundInCache { | |||
// Possible that operator restarted, so we can't assume that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Possibly"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
return true, err | ||
} | ||
|
||
changes = d.Compare(oldDeployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there is one situation that isn't handled by just looking at the deployment spec.
Consider the case where we upgrade the operator and that the new operator will change the driver deployment somehow, for example by adding a new --foobar
parameter for one of the sidecars. This difference between the running driver deployment and the one that the new operator would create is not detected, is it?
What should happen is that the operator should update the objects that it creates such that they match exactly what it would create from scratch. The simplest (?) solution would be to just do that unconditionally each time the operator starts. The apps operator can then check whether it really needs to restart pods.
This logic hinges on both operator versions creating exactly the same objects. At one point me might add some new object (for example, a service) that the old operator doesn't know about. In case of a downgrade, it then would leave that service object unchanged although it's not supposed to be part of the downgraded driver deployment.
I don't have a good solution for this because there is no generic "list all objects with certain labels" API in Kubernetes. One has to do that for all relevant object types, and "relevant" is unknown for a downgrade (the older operator would need to know what will be added in the future). However, we can implement this for an upgrade because then we know all types that may have been created in previous versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there is one situation that isn't handled by just looking at the deployment spec.
Consider the case where we upgrade the operator and that the new operator will change the driver deployment somehow, for example by adding a new --foobar parameter for one of the sidecars. This difference between the running driver deployment and the one that the new operator would create is not detected, is it?
Wouldn't be better to separate normal operator restart from version upgrade? say for version upgrades we could deal by adding an annotation to Deployment object, so on recocile we compare versions and decide if needs to re-deploy a Deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage over simply re-deploying after a restart? That then covers both scenarios with less code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On re-deployment we might not find and reject if any conflicting deployment changes. Say, driverMode/pmemPercentage changed when the operator is not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for invariant attributes we must still check the running driver deployment and update the deployment object accordingly. But then we can simply update all generated objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the changes as suggested, so now deployment reconcile on operator restart refreshes the objects and checks if any incompatible changes to be rejected.
806914a
to
3acc33d
Compare
ffbe3d6
to
2178734
Compare
@pohly I believe I addressed all the points you mentioned. Can you please have a look, If it's ok to merge. |
pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go
Outdated
Show resolved
Hide resolved
if err := deployment.EnsureDefaults(); err != nil { | ||
driverImage, err := r.ContainerImage() | ||
if err != nil { | ||
klog.Warningf("Failed to find the operator image: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be treated as failure because clearly something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to treat this as error then, we should move this call to the operator initialization phase, instead of reconciling loop. Ok, i will do the needed changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, As per now new implementation, the operator exists if fail to get self-image.
test/e2e/operator/deployment_api.go
Outdated
} | ||
|
||
framework.Logf("Deleting the operator '%s/%s' running", dep.Namespace, dep.Name) | ||
if err = f.ClientSet.AppsV1().Deployments(dep.Namespace).Delete(dep.Name, nil); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be simpler to just repeat the Delete
call until you get a NotFound
error. As an added bonus, the code becomes more robust against temporary delete failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Instead of deleting deployment, making its replication count to zero and waiting till the operator pod gets deleted in a repeated loop.
pkg/pmem-csi-operator/controller/deployment/controller_driver.go
Outdated
Show resolved
Hide resolved
err = e | ||
// Force update all deployment objects | ||
if updateAll { | ||
klog.Infof("Updating all objects for deployment %q", d.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test for this case. A thorough test ensures that all objects are slightly different than they should be before the operator starts and afterwards pass the validateCSIDriver
function.
That function itself is not as thorough as it should be (only checks some fields, but not everything); you can ignore that for now, I am working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test that:
- stats a deployment with defaults
- removes operator pod
- updates all deployment fields
- restarts pod
- wait till the deployment timestamp gets updated and ensures that all new values are applied to its sub-objects.
if updateAll { | ||
klog.Infof("Updating all objects for deployment %q", d.Name) | ||
for _, obj := range d.getDeploymentObjects(r) { | ||
if e := r.Update(obj); e != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't update the secrets. Let's merge #586 and then getDeploymentObjects
truly returns all objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: what happens if the object doesn't exist because the operator we update from didn't create it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secrets update is a bit tricky, With the current code, for a deployment with no keys/certificates set, It creates a new set of secrets for every call.
Strictly speaking of this function, it only deals with the objects we created in the Initializing phase, so this will not/need not update the secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow. We need to update all objects only if we found an existing deployment in an unknown state. Why would that then get triggered for every call of the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We end up in this function, only when the deployment is in "Running" phase. So for that deployment secretes must have already created and needs no update. And moreover, secrets creation is not idempotent in case of a deployment using operator generated keys/certificates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We end up in this function, only when the deployment is in "Running" phase.
Which is a problem. We need to update stale objects regardless of the phase.
Let's do this:
- review and merge operator: simplify object creation #586 because it simplifies where we create objects
- remove the "operator: reload state of a running deployment if not found in cache" from this PR because the rest should be okay
- write E2E test cases for reconciling on startup, then check whether the new code passes those tests
if updateAll { | ||
klog.Infof("Updating all objects for deployment %q", d.Name) | ||
for _, obj := range d.getDeploymentObjects(r) { | ||
if e := r.Update(obj); e != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: what happens if the object doesn't exist because the operator we update from didn't create it?
|
||
// buildFromPreDeployed builds a new Deployment object by fetching details | ||
// from existing deplyment sub-resources. | ||
func (d *PmemCSIDriver) buildFromPreDeployed(r *ReconcileDeployment) (*api.Deployment, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: if an object cannot be found, we shouldn't treat that as an error because otherwise the driver deployment gets stuck forever.
Instead, the operator must recover by using the desired deployment parameters for those objects which don't exist, then later create them.
Needs a test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: what happens if the object doesn't exist because the operator we update
from didn't create it?
To handle such cases we could change the semantics of Update()/Create() such that it creates if not exists else it updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: if an object cannot be found, we shouldn't treat that as an error because otherwise the driver deployment gets stuck forever.
Instead, the operator must recover by using the desired deployment parameters for those objects which don't exist, then later create them.
I believe this is how the current implementation of this function treats, it doesn't return any error for an object that was not found. Can you please be specific which part of the code does not follow this.
336b0a4
to
a8b9901
Compare
// that was created by the same deployment | ||
klog.Infof("Updating: %q of type %q ", metaObj.GetName(), obj.GetObjectKind().GroupVersionKind()) | ||
// Update existing active object, | ||
return r.client.Update(context.TODO(), obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that the apiserver will accept the update because the resourceVersion won't match.
Write an E2E test for this and we'll know...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, true. I fixed this by choosing relevant metadata from the object returned by the API server to our new object before update.
It could be used by operator.
Added a check for validating the provided deployment keys and certificates are valid to run the driver. This is done by starting a server with given registry certificates and a client to connect to that server with node-controller certificates. If the connection success then we treat the certificates are valid.
If the provided device mode is not in our supported list, the deployment should fail with the appropriate error. In practice this check is not required as this could be handled by JSON schema validation, but still good to have to avoid any back-doors.
By retrieving the operator pod and it's container image we could ensure that the exact same image is used for deploying the driver. For this to work the operator must provide with these environment variables: - POD_NAME: name of the operator pod(metadata.name) - OPERATOR_NAME: name of the operator container, defaults to `pmem-csi-operator` FIXES: intel#576, intel#578
Deleting CRD deletes all active resources which results in restarting operator pod deletes the driver. So we only delete CRD if no active deployments found. FIXES: intel#579
Added new status field 'lastUpdated' that holds the time stamp when the last the deployment's got updated. In other words when was the last time the deployment got reconciled.
In case of operator restart we refresh all the objects of a reconciling deployment, to handle the case of operator upgrades that results in changes to running driver resources. But at the same time we should be able to handle the conflicting changes done to deployment, during the time of operator restart. To find those changes we will not have a cached object to represent the exisitng deployment. So we prepare the deployment by looking into it's running sub-resources, we use that for finding the diff.
Deployment reconcile should be able to detect and recover any missing objects. We could achieve this by changing the semantics of Reconciler.Create() such that it updates the object if it's already created else it creates the new one.
85fa55e
to
1269370
Compare
7654452
to
7e4bcb3
Compare
This adds missing certificates check in reconcile loop, so that it updates if any such change with new secrets. This also fixes a couple of issues discovred in earlier commits.
For retrieving cluster scoped objects we should not namespace.
Not required as we are not using per test namespace. All tests run in default namespace.
Operator tests are expected to run with new operator per tests, so that we can ensure a test is not effected by any leftover previous test data. But test deployment reuses previously running operator. This change ensues that the operator gets deleted on end of every test. Also made changes to reuse deploy.Cluster object and adjusted the time intervals for tests.
7e4bcb3
to
97ddf69
Compare
Auto-reconciling of a failed deployment due to driver name clash is not required. Because unless user resolves that clash by updating the deployment it can not be succeeded. So reconcile only when that deployment got updated.
That commit argues that reconciliation can be stopped until the user updates the deployment spec. But what if the user resolves the conflict by removing the other driver instance? Then the deployment becomes usable without updating it. |
Which failure is that fixing? Did some other commit change the behavior? Then this commit should be squashed into that other one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has become too large and changes many things at once. Can you break it up into smaller PRs? If there are code conflicts, then let's merge one change after the other.
Did you change something? The latest test run was successful. |
validateDriverDeployment(f, d, deployment) | ||
|
||
// Stop the operator | ||
deleteOperator(c, d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"stopOperator" is a better name for the function. Then you don't need the comment saying that "delete = stop"...
|
||
By("Restarting the operator deployment...") | ||
// Start the operator | ||
createOperator(c, d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createOperator
-> startOperator
All of the commits except(146cde6) are merged through other PRs, so I would close this PR. |
This PR contains below fixes: