Memory-based bookkeeping for Terraform operations #76

muvaf · 2021-09-28T12:33:56Z

Description of your changes

Changes the bookkeeping of the operations from xp.lock and .store files to a global map of operations WorkspaceStore. In addition, there are a couple of things in this PR that's not directly related to bookeeping but help the flow. Main changes are as following:

New WorkspaceStore to hold all Workspace objects that is created per CR instance, identified by their UUIDs.
Handles Terraform crashes by removing Terraform state lock file if the last operation finished (crash, error or success).
It spawns a new go routine for each async execution and runs the given callback once finished.
The callback of Apply operation is to update the annotations to make sure we capture the server-side computed IDs as soon as possible and get enqueued.
Introduces new optional synchronous versions of Apply and Destroy operations for resources whose creation take short time, letting us save the state as fast as it can be.
Assumes that if terraform destroy completes successfully, the resource is gone so that we don't have to implement custom logic to handle resources whose read calls find the resource even after it's actually deleted (KMS keys, some configuration CRs)
Makes the TF client such that you don't have to call ApplyAsync again to poll its results, allowing us to use the latest crossplane-runtime.
[TBD] Moves the adapter into the ExternalClient implementation directly. There is only a single interface now, similar to native providers.
Reorganizes the folder structure to be similar to native providers, i.e. terraform is client package and controller is the managed reconciler implementation package.
Since the tracking is in-memory start/stop of the controller does not end up in a locked state.

This PR is mostly accumulation of our collective experience testing the controllers during implementation of features during the initial push that we released as tech preview.

Special shoutout to @ulucinar who had to deal with this complex beast in the initial rush. The state handling logic he implemented is still the core logic with only some refactoring that incorporated our understanding of the whole thing.

It's ready for review but there are TODOs before merging:

Replace osfile calls with afero for easier testing.
Unit tests for controller.
Unit tests for terraform package.

Fixes #74
Fixes #39
Fixes #22
Fixes #21

I have:

Read and followed Crossplane's contribution process.
Run make reviewable to ensure this PR is ready for review.
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

I have tested it in both async and sync mode using provider-tf-aws. The example YAMLs will be there. Additionally, I tested with stopping/starting the controller and it was able to continue without any errors.

Sync:

apiVersion: vpc.aws.tf.crossplane.io/v1alpha1
kind: Vpc
metadata:
  name: sample-vpc
spec:
  forProvider:
    region: us-west-1
    cidrBlock: 10.0.0.0/16
    tagsAll:
      Name: muvaftest
    tags:
      Name: muvaftest
  providerConfigRef:
    name: example

It works just like other providers. There are a few gotchas. For example, after the deletion is completed we requeue immediately but for some reason that version of custom resource is old, hence, the finalizer removal Update fails but at that point we removed the directory. So, in the next reconcile we init again, check existence and then finalizer is removed successfully. Though this looks like a general problem that should be addressed in controller-runtime or crossplane-runtime. Other than that, I haven't caught an obvious problem.

Async:

apiVersion: rds.aws.tf.crossplane.io/v1alpha1
kind: RdsCluster
metadata:
  name: sample-cluster
spec:
  forProvider:
    region: us-west-1
    engine: aurora-postgresql
    masterUsername: cpadmin
    masterPassword: testPass23!
    skipFinalSnapshot: true
  providerConfigRef:
    name: example

This one also works as expected. One thing different regarding the async mode is that we introdue a new condition to report the last operation status. This would come in the schema of the resource for other providers but since we control the client, we return it as part of RefreshResult and put it under a new condition separately.

One thing I don't love about async mode is that even if the operation takes 5 seconds due to an error, we have to wait for a full poll interval in order to be aware of the error and I couldn't find a way to manually enqueue the resource. Same thing happens with native provider async resources as well but they usually do a client side validation with their async Create call.

I have a hunch that we might be able to get some simplification if we separate sync/async into two ExternalClient implementations but I held that off for now.

ulucinar

Thanks @muvaf. As we discussed in our meeting yesterday, here are some initial comments on the PR.

pkg/tfcli/client.go

ulucinar · 2021-09-29T07:54:27Z

pkg/tfcli/apply.go

 		}
+		w.LastOperation.MarkEnd()


What happens if a Terraform command fails? We need to treat the failing operation as "ended". This looks like not to be the case now.

if err := cmd.Run(); err != nil { w.LastOperation.err = errors.Wrapf(err, "cannot apply: %s", stderr.String()) } w.LastOperation.MarkEnd()

The error of cmd.Run() is not a terminal state here, i.e. we don't return the error. So, in all cases it reaches w.LastOperation.MarkEnd(). I moved around the files a bit, here is where this code is located now.

ulucinar · 2021-09-30T07:55:54Z

pkg/client/workspace.go

+
+func NopEnqueueFn() {}
+
+type Workspace struct {


It looks like we need synchronization on workspaces also.

Added mutex lock to Operation functions since only that property of Workspace is accessed asynchronously.

pkg/client/store.go

ulucinar · 2021-09-30T07:58:11Z

pkg/client/workspace.go

+		if err := cmd.Run(); err != nil {
+			w.LastOperation.Err = errors.Wrapf(err, "cannot apply: %s", stderr.String())
+		}
+		w.LastOperation.MarkEnd()


In general, we had better handle crashes (i.e., non-happy paths).

Added a logic to WorkspaceStore.Workspace function to handle the case we talked related to crashes resulting in lock file leak of Terraform. Is there another case you had in mind?

ulucinar · 2021-09-30T08:03:07Z

pkg/client/workspace.go

+	// It only removes the resource from tfstate file. We need to call Plan to
+	// get an idea about whether the resource exists from Terraform's perspective.
+	if err := cmd.Run(); err != nil {
+		return RefreshResult{}, errors.Wrapf(err, "cannot refresh: %s", stderr.String())


I think this shortcut somewhat deviates from other Crossplane providers' usual pattern of request a deletion, and then observe the external resource (the world state) until it's reported to be missing by the Cloud API.

Do you mean the case where we report that resource does not exist if terraform completed the destroy operation successfully?

If so, this is still the case in controller level (pushed a change to integrate with controller) so it's still the same logic from managed reconciler perspective. But we do take a shortcut here at the client level saying resource does not exist if terraform destroy is successful. Theoretically, it's true that resource may still exist (like in PostgreSQLConfiguration case where the deletion is just setting back to default, or KMS keys where deletion is completed after 14 days) even though the destroy is completed. However, I think it's worth relying on terraform here to reuse their logic around that. Otherwise, we'd have to add custom logic for such resources to approve the deletion.

muvaf · 2021-09-30T13:42:50Z

@ulucinar @turkenh As we talked yesterday, terrajet client now has sync and async variant of both Destroy and Apply operations for different type of resources, i.e. VPCs would be completely sync where as cluster, database etc. whose creation/deletion is long would use async functions. I'm wondering how we do the separation especially in regards to @turkenh 's work around getting input from user:

We can have a bool in controller like IsAsync and add if to Create and Delete functions to toggle between sync and async calls.
We can have two different ExternalClient implementations and user would supply which one to use as config.
We can have two different Adapters that act as the backend of the same ExternalClient and user would choose which backend to use, which is similar to current implementation, though most ExternalClient calls become just a call to the adapter interface (the pushed change just moved the same logic to controller directly)

I am leaning towards the first option where we can return early in async cases with a simple if so the code wouldn't look complex to read. Between 2 and 3, I'm sort of on the side of 2 where we have two distinct controllers and it'd be a single place to make changes if needed for both cases instead of having to go through another layer. What are your thoughts?

pkg/terraform/operation.go

pkg/terraform/store.go

pkg/terraform/workspace.go

pkg/controller/external.go

turkenh · 2021-10-01T22:02:31Z

pkg/terraform/workspace.go

+	type plan struct {
+		Changes struct {
+			Add    float64 `json:"add,omitempty"`
+			Change float64 `json:"change,omitempty"`


Could we also return a Diff here on what needs to be updated, which could finally set asExternalObservation.Diff?

This would be quite helpful for debugging infamous getting constant updates we observe in other providers.

Added #44 (comment) to cover both diff and error reporting, assuming the JSON parsing would do both.

pkg/controller/external.go

turkenh · 2021-10-02T07:42:00Z

@ulucinar @turkenh As we talked yesterday, terrajet client now has sync and async variant of both Destroy and Apply operations for different type of resources, i.e. VPCs would be completely sync where as cluster, database etc. whose creation/deletion is long would use async functions. I'm wondering how we do the separation especially in regards to @turkenh 's work around getting input from user:

We can have a bool in controller like IsAsync and add if to Create and Delete functions to toggle between sync and async calls.

We can have two different ExternalClient implementations and user would supply which one to use as config.

We can have two different Adapters that act as the backend of the same ExternalClient and user would choose which backend to use, which is similar to current implementation, though most ExternalClient calls become just a call to the adapter interface (the pushed change just moved the same logic to controller directly)

I am leaning towards the first option where we can return early in async cases with a simple if so the code wouldn't look complex to read. Between 2 and 3, I'm sort of on the side of 2 where we have two distinct controllers and it'd be a single place to make changes if needed for both cases instead of having to go through another layer. What are your thoughts?

@muvaf the first option sounds good to me. We can infer a sensible default per resource using schema.ResourceTimeout (e.g. if Create/Update Timeout > 5mins use async) and still allow provider developers to override the default behavior. We can generate setup.go based on the final value and pass this as an option to the controller.

For example, AWS RDS cluster has a default of 120 mins and we could choose async mode by default.

muvaf · 2021-10-02T23:16:40Z

We can infer a sensible default per resource using schema.ResourceTimeout (e.g. if Create/Update Timeout > 5mins use async) and still allow provider developers to override the default behavior. We can generate setup.go based on the final value and pass this as an option to the controller.

@turkenh I think that's a great idea and we can use that information to default to sync mode if the timeout is 0. But I'd prefer implementing this in another PR since this one is already very big. Could you open an issue with the links you discovered?

turkenh · 2021-10-03T20:01:25Z

We can infer a sensible default per resource using schema.ResourceTimeout (e.g. if Create/Update Timeout > 5mins use async) and still allow provider developers to override the default behavior. We can generate setup.go based on the final value and pass this as an option to the controller.

@turkenh I think that's a great idea and we can use that information to default to sync mode if the timeout is 0. But I'd prefer implementing this in another PR since this one is already very big. Could you open an issue with the links you discovered?

Sounds good, here it is: #78