Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Memory-based bookkeeping for Terraform operations #76

Merged
merged 14 commits into from
Oct 6, 2021

Conversation

muvaf
Copy link
Member

@muvaf muvaf commented Sep 28, 2021

Description of your changes

Changes the bookkeeping of the operations from xp.lock and .store files to a global map of operations WorkspaceStore. In addition, there are a couple of things in this PR that's not directly related to bookeeping but help the flow. Main changes are as following:

  • New WorkspaceStore to hold all Workspace objects that is created per CR instance, identified by their UUIDs.
  • Handles Terraform crashes by removing Terraform state lock file if the last operation finished (crash, error or success).
  • It spawns a new go routine for each async execution and runs the given callback once finished.
  • The callback of Apply operation is to update the annotations to make sure we capture the server-side computed IDs as soon as possible and get enqueued.
  • Introduces new optional synchronous versions of Apply and Destroy operations for resources whose creation take short time, letting us save the state as fast as it can be.
  • Assumes that if terraform destroy completes successfully, the resource is gone so that we don't have to implement custom logic to handle resources whose read calls find the resource even after it's actually deleted (KMS keys, some configuration CRs)
  • Makes the TF client such that you don't have to call ApplyAsync again to poll its results, allowing us to use the latest crossplane-runtime.
  • [TBD] Moves the adapter into the ExternalClient implementation directly. There is only a single interface now, similar to native providers.
  • Reorganizes the folder structure to be similar to native providers, i.e. terraform is client package and controller is the managed reconciler implementation package.
  • Since the tracking is in-memory start/stop of the controller does not end up in a locked state.

This PR is mostly accumulation of our collective experience testing the controllers during implementation of features during the initial push that we released as tech preview.

Special shoutout to @ulucinar who had to deal with this complex beast in the initial rush. The state handling logic he implemented is still the core logic with only some refactoring that incorporated our understanding of the whole thing.

It's ready for review but there are TODOs before merging:

  • Replace osfile calls with afero for easier testing.
  • Unit tests for controller.
  • Unit tests for terraform package.

Fixes #74
Fixes #39
Fixes #22
Fixes #21

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

I have tested it in both async and sync mode using provider-tf-aws. The example YAMLs will be there. Additionally, I tested with stopping/starting the controller and it was able to continue without any errors.

Sync:

apiVersion: vpc.aws.tf.crossplane.io/v1alpha1
kind: Vpc
metadata:
  name: sample-vpc
spec:
  forProvider:
    region: us-west-1
    cidrBlock: 10.0.0.0/16
    tagsAll:
      Name: muvaftest
    tags:
      Name: muvaftest
  providerConfigRef:
    name: example

It works just like other providers. There are a few gotchas. For example, after the deletion is completed we requeue immediately but for some reason that version of custom resource is old, hence, the finalizer removal Update fails but at that point we removed the directory. So, in the next reconcile we init again, check existence and then finalizer is removed successfully. Though this looks like a general problem that should be addressed in controller-runtime or crossplane-runtime. Other than that, I haven't caught an obvious problem.

Async:

apiVersion: rds.aws.tf.crossplane.io/v1alpha1
kind: RdsCluster
metadata:
  name: sample-cluster
spec:
  forProvider:
    region: us-west-1
    engine: aurora-postgresql
    masterUsername: cpadmin
    masterPassword: testPass23!
    skipFinalSnapshot: true
  providerConfigRef:
    name: example

This one also works as expected. One thing different regarding the async mode is that we introdue a new condition to report the last operation status. This would come in the schema of the resource for other providers but since we control the client, we return it as part of RefreshResult and put it under a new condition separately.

One thing I don't love about async mode is that even if the operation takes 5 seconds due to an error, we have to wait for a full poll interval in order to be aware of the error and I couldn't find a way to manually enqueue the resource. Same thing happens with native provider async resources as well but they usually do a client side validation with their async Create call.

I have a hunch that we might be able to get some simplification if we separate sync/async into two ExternalClient implementations but I held that off for now.

@muvaf muvaf requested review from negz, ulucinar and turkenh September 28, 2021 12:33
@muvaf muvaf force-pushed the losing-my-memory branch 2 times, most recently from 0532e1f to 7023caf Compare September 29, 2021 20:52
Copy link
Collaborator

@ulucinar ulucinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @muvaf. As we discussed in our meeting yesterday, here are some initial comments on the PR.

pkg/tfcli/client.go Outdated Show resolved Hide resolved
}
w.LastOperation.MarkEnd()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a Terraform command fails? We need to treat the failing operation as "ended". This looks like not to be the case now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := cmd.Run(); err != nil {
  w.LastOperation.err = errors.Wrapf(err, "cannot apply: %s", stderr.String())
}
w.LastOperation.MarkEnd()

The error of cmd.Run() is not a terminal state here, i.e. we don't return the error. So, in all cases it reaches w.LastOperation.MarkEnd(). I moved around the files a bit, here is where this code is located now.


func NopEnqueueFn() {}

type Workspace struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we need synchronization on workspaces also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mutex lock to Operation functions since only that property of Workspace is accessed asynchronously.

pkg/client/store.go Outdated Show resolved Hide resolved
if err := cmd.Run(); err != nil {
w.LastOperation.Err = errors.Wrapf(err, "cannot apply: %s", stderr.String())
}
w.LastOperation.MarkEnd()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we had better handle crashes (i.e., non-happy paths).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a logic to WorkspaceStore.Workspace function to handle the case we talked related to crashes resulting in lock file leak of Terraform. Is there another case you had in mind?

// It only removes the resource from tfstate file. We need to call Plan to
// get an idea about whether the resource exists from Terraform's perspective.
if err := cmd.Run(); err != nil {
return RefreshResult{}, errors.Wrapf(err, "cannot refresh: %s", stderr.String())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shortcut somewhat deviates from other Crossplane providers' usual pattern of request a deletion, and then observe the external resource (the world state) until it's reported to be missing by the Cloud API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the case where we report that resource does not exist if terraform completed the destroy operation successfully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, this is still the case in controller level (pushed a change to integrate with controller) so it's still the same logic from managed reconciler perspective. But we do take a shortcut here at the client level saying resource does not exist if terraform destroy is successful. Theoretically, it's true that resource may still exist (like in PostgreSQLConfiguration case where the deletion is just setting back to default, or KMS keys where deletion is completed after 14 days) even though the destroy is completed. However, I think it's worth relying on terraform here to reuse their logic around that. Otherwise, we'd have to add custom logic for such resources to approve the deletion.

@muvaf
Copy link
Member Author

muvaf commented Sep 30, 2021

@ulucinar @turkenh As we talked yesterday, terrajet client now has sync and async variant of both Destroy and Apply operations for different type of resources, i.e. VPCs would be completely sync where as cluster, database etc. whose creation/deletion is long would use async functions. I'm wondering how we do the separation especially in regards to @turkenh 's work around getting input from user:

  • We can have a bool in controller like IsAsync and add if to Create and Delete functions to toggle between sync and async calls.
  • We can have two different ExternalClient implementations and user would supply which one to use as config.
  • We can have two different Adapters that act as the backend of the same ExternalClient and user would choose which backend to use, which is similar to current implementation, though most ExternalClient calls become just a call to the adapter interface (the pushed change just moved the same logic to controller directly)

I am leaning towards the first option where we can return early in async cases with a simple if so the code wouldn't look complex to read. Between 2 and 3, I'm sort of on the side of 2 where we have two distinct controllers and it'd be a single place to make changes if needed for both cases instead of having to go through another layer. What are your thoughts?

@muvaf muvaf force-pushed the losing-my-memory branch 2 times, most recently from 0813b12 to b38e800 Compare September 30, 2021 15:00
@muvaf muvaf marked this pull request as ready for review September 30, 2021 15:32
pkg/terraform/operation.go Outdated Show resolved Hide resolved
pkg/terraform/operation.go Outdated Show resolved Hide resolved
pkg/terraform/store.go Outdated Show resolved Hide resolved
pkg/terraform/store.go Outdated Show resolved Hide resolved
pkg/terraform/store.go Outdated Show resolved Hide resolved
pkg/terraform/workspace.go Outdated Show resolved Hide resolved
pkg/terraform/workspace.go Outdated Show resolved Hide resolved
pkg/controller/external.go Outdated Show resolved Hide resolved
type plan struct {
Changes struct {
Add float64 `json:"add,omitempty"`
Change float64 `json:"change,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also return a Diff here on what needs to be updated, which could finally set asExternalObservation.Diff?

This would be quite helpful for debugging infamous getting constant updates we observe in other providers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #44 (comment) to cover both diff and error reporting, assuming the JSON parsing would do both.

pkg/controller/external.go Show resolved Hide resolved
@turkenh
Copy link
Member

turkenh commented Oct 2, 2021

@ulucinar @turkenh As we talked yesterday, terrajet client now has sync and async variant of both Destroy and Apply operations for different type of resources, i.e. VPCs would be completely sync where as cluster, database etc. whose creation/deletion is long would use async functions. I'm wondering how we do the separation especially in regards to @turkenh 's work around getting input from user:

  • We can have a bool in controller like IsAsync and add if to Create and Delete functions to toggle between sync and async calls.
  • We can have two different ExternalClient implementations and user would supply which one to use as config.
  • We can have two different Adapters that act as the backend of the same ExternalClient and user would choose which backend to use, which is similar to current implementation, though most ExternalClient calls become just a call to the adapter interface (the pushed change just moved the same logic to controller directly)

I am leaning towards the first option where we can return early in async cases with a simple if so the code wouldn't look complex to read. Between 2 and 3, I'm sort of on the side of 2 where we have two distinct controllers and it'd be a single place to make changes if needed for both cases instead of having to go through another layer. What are your thoughts?

@muvaf the first option sounds good to me. We can infer a sensible default per resource using schema.ResourceTimeout (e.g. if Create/Update Timeout > 5mins use async) and still allow provider developers to override the default behavior. We can generate setup.go based on the final value and pass this as an option to the controller.

For example, AWS RDS cluster has a default of 120 mins and we could choose async mode by default.

@muvaf
Copy link
Member Author

muvaf commented Oct 2, 2021

We can infer a sensible default per resource using schema.ResourceTimeout (e.g. if Create/Update Timeout > 5mins use async) and still allow provider developers to override the default behavior. We can generate setup.go based on the final value and pass this as an option to the controller.

@turkenh I think that's a great idea and we can use that information to default to sync mode if the timeout is 0. But I'd prefer implementing this in another PR since this one is already very big. Could you open an issue with the links you discovered?

@turkenh
Copy link
Member

turkenh commented Oct 3, 2021

We can infer a sensible default per resource using schema.ResourceTimeout (e.g. if Create/Update Timeout > 5mins use async) and still allow provider developers to override the default behavior. We can generate setup.go based on the final value and pass this as an option to the controller.

@turkenh I think that's a great idea and we can use that information to default to sync mode if the timeout is 0. But I'd prefer implementing this in another PR since this one is already very big. Could you open an issue with the links you discovered?

Sounds good, here it is: #78

Signed-off-by: Muvaffak Onus <me@muvaf.com>
@muvaf muvaf force-pushed the losing-my-memory branch from b7b25ca to d73c75f Compare October 5, 2021 12:40
muvaf added 8 commits October 5, 2021 15:42
process: remove unused process package

client: make workspace store map safe for concurrent use

client: make operation functions safe for concurrent calls

client: clean up the terraform lock file if it leaked

Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
…ient interface and error reporting

Signed-off-by: Muvaffak Onus <me@muvaf.com>
…suggested in documentation of sync.Map

Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
…er functions

Signed-off-by: Muvaffak Onus <me@muvaf.com>
…nd return last operation error

Signed-off-by: Muvaffak Onus <me@muvaf.com>
muvaf added 3 commits October 5, 2021 15:43
… util

Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
@muvaf muvaf force-pushed the losing-my-memory branch from d73c75f to d53f1df Compare October 5, 2021 12:43
@muvaf
Copy link
Member Author

muvaf commented Oct 5, 2021

@turkenh @ulucinar I rebased this and integrated the latest changes with the controller, specifically late initializer. Please take another look. @turkenh I will do another pass of manual tests now to see if we can implement your suggestions about printing errors and returning diff.

@muvaf
Copy link
Member Author

muvaf commented Oct 5, 2021

The missing tests are the tests of Workspace and WorkspaceStore objects which depend on exec functionality. I'm thinking of bootstrapping a small tool to mock the exec interface, something similar to afero that we use for filesystem operations. But that's for another PR.

@muvaf muvaf force-pushed the losing-my-memory branch from d53f1df to caab0e6 Compare October 5, 2021 13:39
Signed-off-by: Muvaffak Onus <me@muvaf.com>
@muvaf muvaf force-pushed the losing-my-memory branch from caab0e6 to 09cbb70 Compare October 5, 2021 13:45
Copy link
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @muvaf, great work!

Added a couple of comments, mostly nits. I am fine with merging after you check/resolve them.

"k8s.io/apimachinery/pkg/util/json"
)

// Observable is mock Observable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/terraform/files.go Show resolved Hide resolved
pkg/controller/external.go Show resolved Hide resolved
pkg/controller/external.go Show resolved Hide resolved
pkg/controller/external.go Show resolved Hide resolved
pkg/controller/external.go Outdated Show resolved Hide resolved
pkg/controller/external.go Outdated Show resolved Hide resolved
ctx, cancel := context.WithDeadline(context.TODO(), w.LastOperation.StartTime().Add(defaultAsyncTimeout))
go func() {
defer cancel()
cmd := exec.CommandContext(ctx, "terraform", "apply", "-auto-approve", "-input=false", "-lock=false", "-json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned about the duplicate code in synced and async methods.
Could we refactor them into a single flow somehow to get rid of code duplication?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return statements and last operation checks make us not be able to de-duplicate unfortunately. We could make the exec.CommandContext parts generic but I don't think it's worth the indirection.

pkg/types/comments/comment.go Outdated Show resolved Hide resolved
pkg/types/comments/comment_test.go Outdated Show resolved Hide resolved
…d refresh result in ongoing state

Signed-off-by: Muvaffak Onus <me@muvaf.com>
@muvaf muvaf force-pushed the losing-my-memory branch from 869ed88 to cda3ec3 Compare October 5, 2021 23:05
Copy link
Collaborator

@ulucinar ulucinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @muvaf, lgtm.

@muvaf muvaf merged commit 7e7b3be into crossplane:main Oct 6, 2021
@muvaf muvaf deleted the losing-my-memory branch October 6, 2021 11:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants