-
Notifications
You must be signed in to change notification settings - Fork 38
Memory-based bookkeeping for Terraform operations #76
Conversation
0532e1f
to
7023caf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @muvaf. As we discussed in our meeting yesterday, here are some initial comments on the PR.
pkg/tfcli/apply.go
Outdated
} | ||
w.LastOperation.MarkEnd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a Terraform command fails? We need to treat the failing operation as "ended". This looks like not to be the case now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := cmd.Run(); err != nil {
w.LastOperation.err = errors.Wrapf(err, "cannot apply: %s", stderr.String())
}
w.LastOperation.MarkEnd()
The error of cmd.Run()
is not a terminal state here, i.e. we don't return
the error. So, in all cases it reaches w.LastOperation.MarkEnd()
. I moved around the files a bit, here is where this code is located now.
pkg/client/workspace.go
Outdated
|
||
func NopEnqueueFn() {} | ||
|
||
type Workspace struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we need synchronization on workspaces also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mutex lock to Operation
functions since only that property of Workspace
is accessed asynchronously.
pkg/client/workspace.go
Outdated
if err := cmd.Run(); err != nil { | ||
w.LastOperation.Err = errors.Wrapf(err, "cannot apply: %s", stderr.String()) | ||
} | ||
w.LastOperation.MarkEnd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we had better handle crashes (i.e., non-happy paths).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a logic to WorkspaceStore.Workspace
function to handle the case we talked related to crashes resulting in lock file leak of Terraform. Is there another case you had in mind?
pkg/client/workspace.go
Outdated
// It only removes the resource from tfstate file. We need to call Plan to | ||
// get an idea about whether the resource exists from Terraform's perspective. | ||
if err := cmd.Run(); err != nil { | ||
return RefreshResult{}, errors.Wrapf(err, "cannot refresh: %s", stderr.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this shortcut somewhat deviates from other Crossplane providers' usual pattern of request a deletion, and then observe the external resource (the world state) until it's reported to be missing by the Cloud API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the case where we report that resource does not exist if terraform
completed the destroy operation successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, this is still the case in controller level (pushed a change to integrate with controller) so it's still the same logic from managed reconciler perspective. But we do take a shortcut here at the client level saying resource does not exist if terraform destroy is successful. Theoretically, it's true that resource may still exist (like in PostgreSQLConfiguration case where the deletion is just setting back to default, or KMS keys where deletion is completed after 14 days) even though the destroy is completed. However, I think it's worth relying on terraform here to reuse their logic around that. Otherwise, we'd have to add custom logic for such resources to approve the deletion.
@ulucinar @turkenh As we talked yesterday, terrajet client now has sync and async variant of both
I am leaning towards the first option where we can return early in async cases with a simple if so the code wouldn't look complex to read. Between 2 and 3, I'm sort of on the side of 2 where we have two distinct controllers and it'd be a single place to make changes if needed for both cases instead of having to go through another layer. What are your thoughts? |
0813b12
to
b38e800
Compare
type plan struct { | ||
Changes struct { | ||
Add float64 `json:"add,omitempty"` | ||
Change float64 `json:"change,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also return a Diff here on what needs to be updated, which could finally set asExternalObservation.Diff?
This would be quite helpful for debugging infamous getting constant updates we observe in other providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #44 (comment) to cover both diff and error reporting, assuming the JSON parsing would do both.
@muvaf the first option sounds good to me. We can infer a sensible default per resource using schema.ResourceTimeout (e.g. if Create/Update Timeout > 5mins use async) and still allow provider developers to override the default behavior. We can generate For example, AWS RDS cluster has a default of 120 mins and we could choose async mode by default. |
@turkenh I think that's a great idea and we can use that information to default to |
4c299fc
to
34650a8
Compare
Sounds good, here it is: #78 |
Signed-off-by: Muvaffak Onus <me@muvaf.com>
b7b25ca
to
d73c75f
Compare
process: remove unused process package client: make workspace store map safe for concurrent use client: make operation functions safe for concurrent calls client: clean up the terraform lock file if it leaked Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
…ient interface and error reporting Signed-off-by: Muvaffak Onus <me@muvaf.com>
…suggested in documentation of sync.Map Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
…er functions Signed-off-by: Muvaffak Onus <me@muvaf.com>
…nd return last operation error Signed-off-by: Muvaffak Onus <me@muvaf.com>
… util Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
Signed-off-by: Muvaffak Onus <me@muvaf.com>
d73c75f
to
d53f1df
Compare
The missing tests are the tests of |
d53f1df
to
caab0e6
Compare
Signed-off-by: Muvaffak Onus <me@muvaf.com>
caab0e6
to
09cbb70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @muvaf, great work!
Added a couple of comments, mostly nits. I am fine with merging after you check/resolve them.
"k8s.io/apimachinery/pkg/util/json" | ||
) | ||
|
||
// Observable is mock Observable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, I happily used gomock to generate mock for the SecretClient interface in my PR: https://github.com/crossplane-contrib/terrajet/pull/77/files#diff-30423698cfc4cdfa0e53dbb2e91d41ccbadb1b70a34968349f21a7238d6d11f1R30
ctx, cancel := context.WithDeadline(context.TODO(), w.LastOperation.StartTime().Add(defaultAsyncTimeout)) | ||
go func() { | ||
defer cancel() | ||
cmd := exec.CommandContext(ctx, "terraform", "apply", "-auto-approve", "-input=false", "-lock=false", "-json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned about the duplicate code in synced and async methods.
Could we refactor them into a single flow somehow to get rid of code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return statements and last operation checks make us not be able to de-duplicate unfortunately. We could make the exec.CommandContext
parts generic but I don't think it's worth the indirection.
…d refresh result in ongoing state Signed-off-by: Muvaffak Onus <me@muvaf.com>
869ed88
to
cda3ec3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @muvaf, lgtm.
Description of your changes
Changes the bookkeeping of the operations from
xp.lock
and.store
files to a global map of operationsWorkspaceStore
. In addition, there are a couple of things in this PR that's not directly related to bookeeping but help the flow. Main changes are as following:WorkspaceStore
to hold allWorkspace
objects that is created per CR instance, identified by their UUIDs.Apply
operation is to update the annotations to make sure we capture the server-side computed IDs as soon as possible and get enqueued.Apply
andDestroy
operations for resources whose creation take short time, letting us save the state as fast as it can be.terraform destroy
completes successfully, the resource is gone so that we don't have to implement custom logic to handle resources whose read calls find the resource even after it's actually deleted (KMS keys, some configuration CRs)ApplyAsync
again to poll its results, allowing us to use the latest crossplane-runtime.ExternalClient
implementation directly. There is only a single interface now, similar to native providers.terraform
is client package andcontroller
is the managed reconciler implementation package.This PR is mostly accumulation of our collective experience testing the controllers during implementation of features during the initial push that we released as tech preview.
Special shoutout to @ulucinar who had to deal with this complex beast in the initial rush. The state handling logic he implemented is still the core logic with only some refactoring that incorporated our understanding of the whole thing.
It's ready for review but there are TODOs before merging:
os
file calls with afero for easier testing.terraform
package.Fixes #74
Fixes #39
Fixes #22
Fixes #21
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
I have tested it in both async and sync mode using provider-tf-aws. The example YAMLs will be there. Additionally, I tested with stopping/starting the controller and it was able to continue without any errors.
Sync:
It works just like other providers. There are a few gotchas. For example, after the deletion is completed we requeue immediately but for some reason that version of custom resource is old, hence, the finalizer removal
Update
fails but at that point we removed the directory. So, in the next reconcile we init again, check existence and then finalizer is removed successfully. Though this looks like a general problem that should be addressed in controller-runtime or crossplane-runtime. Other than that, I haven't caught an obvious problem.Async:
This one also works as expected. One thing different regarding the async mode is that we introdue a new condition to report the last operation status. This would come in the schema of the resource for other providers but since we control the client, we return it as part of RefreshResult and put it under a new condition separately.
One thing I don't love about async mode is that even if the operation takes 5 seconds due to an error, we have to wait for a full poll interval in order to be aware of the error and I couldn't find a way to manually enqueue the resource. Same thing happens with native provider async resources as well but they usually do a client side validation with their async
Create
call.I have a hunch that we might be able to get some simplification if we separate sync/async into two
ExternalClient
implementations but I held that off for now.