-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace and debug logging is extremely expensive (x32 slower) for large resource diffs #721
Comments
There are a couple of possible fixes that come to mind:
I'd defer to people more familiar with the project on this, though. |
This commit addresses this issue: hashicorp#721 When working with large terraform resource attributes, logging can become the primary cost to running plan/apply. This is because the context logger used by tfsdklog is very expensive, the SDK framework itself is very verbose, and even the log level means Debug/Trace logs won't be emitted we still build the log. As we log via the internal logging package anyway, we can avoid this performance issue by maintaining a package-local log-level which is initialised at the same time as we init the contextual logger. Then our logging package can avoid calling into the tfsdklog (which is where we pay the performance penalty) unless it believes the level is active, which in most cases avoids 99% of the log cost.
This PR avoids this issue when trace/debug levels are not active (most production situations) by maintaining a log-level guard inside of the framework code: #722 We're going to run with this for now, very interested in people's thoughts more generally. If I had more time, I'd probably be inclined to (1) get this merged and (2) investigate tfsdklog's performance later, as I expect something weird is going on around context nesting and key length that makes usage really expensive. |
Introduce a resource that can manage entries in bulk, which: 1. Solves terraform performance issues when dealing with a lot of entries 2. Reduces pressure on our API when working with entries To achieve this we've forked the terraform sdk framework to improve performance when logging, which is necessary to avoid massive plan times on large nested objects. Relevant links are: - PR for logging change (hashicorp/terraform-plugin-framework#722) - Issue in upstream (hashicorp/terraform-plugin-framework#721)
Introduce a resource that can manage entries in bulk, which: 1. Solves terraform performance issues when dealing with a lot of entries 2. Reduces pressure on our API when working with entries To achieve this we've forked the terraform sdk framework to improve performance when logging, which is necessary to avoid massive plan times on large nested objects. Relevant links are: - PR for logging change (hashicorp/terraform-plugin-framework#722) - Issue in upstream (hashicorp/terraform-plugin-framework#721)
Introduce a resource that can manage entries in bulk, which: 1. Solves terraform performance issues when dealing with a lot of entries 2. Reduces pressure on our API when working with entries To achieve this we've forked the terraform sdk framework to improve performance when logging, which is necessary to avoid massive plan times on large nested objects. Relevant links are: - PR for logging change (hashicorp/terraform-plugin-framework#722) - Issue in upstream (hashicorp/terraform-plugin-framework#721)
Introduce a resource that can manage entries in bulk, which: 1. Solves terraform performance issues when dealing with a lot of entries 2. Reduces pressure on our API when working with entries To achieve this we've forked the terraform sdk framework to improve performance when logging, which is necessary to avoid massive plan times on large nested objects. Relevant links are: - PR for logging change (hashicorp/terraform-plugin-framework#722) - Issue in upstream (hashicorp/terraform-plugin-framework#721)
This commit addresses this issue: hashicorp#721 When working with large terraform resource attributes, logging can become the primary cost to running plan/apply. This is because the context logger used by tfsdklog is very expensive, the SDK framework itself is very verbose, and even the log level means Debug/Trace logs won't be emitted we still build the log. As we log via the internal logging package anyway, we can avoid this performance issue by maintaining a package-local log-level which is initialised at the same time as we init the contextual logger. Then our logging package can avoid calling into the tfsdklog (which is where we pay the performance penalty) unless it believes the level is active, which in most cases avoids 99% of the log cost.
Reference: hashicorp/terraform-plugin-framework#721 This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior. The following benchmark was ran prior to the introduction of the level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled100-10 4616656 258.0 ns/op BenchmarkSubsystemTraceEnabled100-10 936865 1138 ns/op BenchmarkSubsystemTraceDisabled1000-10 4552549 261.1 ns/op BenchmarkSubsystemTraceEnabled1000-10 956137 1126 ns/op BenchmarkSubsystemTraceDisabled10000-10 4475106 261.4 ns/op BenchmarkSubsystemTraceEnabled10000-10 943819 1121 ns/op ``` After the addition of level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled100-10 86043471 13.78 ns/op BenchmarkSubsystemTraceEnabled100-10 906649 1137 ns/op BenchmarkSubsystemTraceDisabled1000-10 83207169 13.81 ns/op BenchmarkSubsystemTraceEnabled1000-10 935115 1122 ns/op BenchmarkSubsystemTraceDisabled10000-10 83775966 13.77 ns/op BenchmarkSubsystemTraceEnabled10000-10 909601 1134 ns/op ``` This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.
Reference: hashicorp/terraform-plugin-framework#721 This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior. The following benchmark was ran prior to the introduction of the level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 4616656 258.0 ns/op BenchmarkSubsystemTraceEnabled-10 936865 1138 ns/op ``` After the addition of level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 86043471 13.78 ns/op BenchmarkSubsystemTraceEnabled-10 906649 1137 ns/op ``` This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.
Hi @lawrencejones 👋 Thank you for reporting this issue and sorry for the frustrating behavior. Over in hashicorp/terraform-plugin-log#149 it appears we can introduce some level caching and checks to avoid unnecessary |
Thank you a lot for putting together! I'm at kubecon this week so a bit hectic, but I'll be happy to try this out when I get some time. We have a terraform codebase that is un-runnable without this type of patch, so should be easy to confirm it works/doesn't. |
Hi again, @lawrencejones 👋 Did you happen to have an opportunity to try out the terraform-plugin-log change? No worries if not. |
#149) Reference: hashicorp/terraform-plugin-framework#721 This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior. The following benchmark was ran prior to the introduction of the level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 4616656 258.0 ns/op BenchmarkSubsystemTraceEnabled-10 936865 1138 ns/op ``` After the addition of level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 86043471 13.78 ns/op BenchmarkSubsystemTraceEnabled-10 906649 1137 ns/op ``` This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.
The framework now pulls in some upstream terraform-plugin-log changes via #744 so we can roll this out across the other SDKs as well -- if we still find there are performance issues, we can determine next steps in a fresh new issue. 👍 |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Hey team 👋
Was unsure if this would come under bug or enhancement, but given that this is preventing my provider from creating a plan, I went for bug.
tl;dr: we have a resource that accepts a large and deeply nested object. Trace and debug logs produced by the SDK around setting defaults/state modifications/etc are so verbose that it causes planning to become much more expensive (16s -> 527s, so 32x).
Having profiled the provider as it produces the plan, it appears the majority of the cost comes from the context based logger. Generally logging this much can be expensive, but particularly reflection-based runtime loggers that pull context.Value frequently, as the time adds up.
Attaching the profile below where you can see of the 28.52s sample, we spent 22.38s in
context.(*valueCtx).Value
as part of the logging library, and the rest in logging adjacent functions.Module version
Definitely running latest as I've pulled the repo locally to try investigating this.
Relevant provider source code
PlanResourceChange (https://github.com/hashicorp/terraform-plugin-framework/blob/main/internal/fwserver/server_planresourcechange.go#L44) is where we're spending all our time, which is due to the many log calls.
Applying this patch is what gives the 32x improvement in plan speed:
Terraform Configuration Files
I've got a big nested
entries
attribute on my resource that looks like this:I can share the provider but I think the only part that is probably relevant is the schema:
Debug Output
Debug and trace logs kinda are the problem here, so I'll point people at the top profile output.
Expected Behavior
My plan executes in a normal time (<30s).
Actual Behavior
My plan takes so long to execute that most people would give up (30s+, actually only finishes ~10m later).
Steps to Reproduce
Build a highly nested object with many entries and attempt a plan.
References
Not that I'm aware. I did search for
logging
andperformance
in this repo and a few others, but couldn't find anything.The text was updated successfully, but these errors were encountered: