Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks #26130

Closed
RealFatCat opened this issue Aug 4, 2022 · 3 comments
Closed

Memory leaks #26130

RealFatCat opened this issue Aug 4, 2022 · 3 comments
Labels
provider Pertains to the provider itself, rather than any interaction with AWS. service/ec2 Issues and PRs that pertain to the ec2 service. service/vpc Issues and PRs that pertain to the vpc service. stale Old or inactive issues managed by automation, if no further action taken these will get closed.

Comments

@RealFatCat
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

$ terraform --version
Terraform v1.2.6
on linux_amd64
terraform-provider-aws_v4.24.0_x5

Affected Resource(s)

We are creating lots of the resources, to create ec2 instances, eks clusters, etc.
So, for ec2 instance, for example:

  • aws_instance
  • aws_internet_gateway
  • aws_network_interface
  • aws_route
  • aws_route_table
  • aws_route_table_association
  • aws_security_group
  • aws_security_group_rule
  • aws_subnet
  • aws_vpc

Expected Behavior

No memory leaks.

Actual Behavior

Memory leaks.

Steps to Reproduce

Provider should be running.
Constatntly run commands like:
terraform apply -refresh-only -auto-approve -input=false -lock=false -json
terraform plan -refresh=false -input=false -lock=false -json

Important Factoids

At first, I'm not quite sure, that this is the right place for this issue, but I'll try to explain what we are doing.

We run crossplane provider-jet-aws in k8s cluster.
tl;dr; crossplane-provider transforms k8s manifests of aws resources to terraform configs and applies them.

In the pod, crossplane-provider starts terraform-provider-aws.
After that crossplane-provider just runs terraform CLI commands, like terraform init, terraform plan, terraform apply, etc.

So, terraform-provider-aws is constantly running.

After some time (4-5 hours), the pod is killed with OOM, and the reason is terraform-provider-aws.

For example, current ps axu in the pod:

$ ps aux
PID   USER     TIME  COMMAND
    1 1001     16:25 crossplane-provider -d -s 30s --terraform-version 1.2.6 --terraform-provider-version 4.24.0 --terraform-provider-source hashicorp/aws --max-reconcile-rate 200 --leader-election
15793 1001      0:00 sh
26435 1001      0:00 sh
26522 1001      0:00 sh
26696 1001      0:01 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26708 1001      0:01 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26721 1001      0:01 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26726 1001      0:01 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26744 1001      0:00 terraform plan -refresh=false -input=false -lock=false -json
26758 1001      0:00 terraform plan -refresh=false -input=false -lock=false -json
26768 1001      0:00 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26782 1001      0:00 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26790 1001      0:00 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26808 1001      0:00 terraform apply -refresh-only -auto-approve -input=false -lock=false -json
26820 1001      0:00 ps aux
28712 1001      2h25 /terraform/provider-mirror/registry.terraform.io/hashicorp/aws/4.24.0/linux_amd64/terraform-provider-aws_v4.24.0_x5

According to the pid of terraform-provider-aws, something already went wrong, and the process was restarted by the crossplane-provider.

Anyway, here is RssAnon of terraform-provider-aws:

$ cat /proc/28712/status  |grep -i rssanon
RssAnon:	 6974484 kB

Also, I've managed to run terraform-provider-aws with pprof, so here some files.

profile001

heap.gz

I've found some issues with memory leaks in grpc-go repo, but they are all closed. So, I decided to post here.

Thanks in advance for any help.

@github-actions github-actions bot added needs-triage Waiting for first response or review from a maintainer. service/ec2 Issues and PRs that pertain to the ec2 service. service/vpc Issues and PRs that pertain to the vpc service. labels Aug 4, 2022
@justinretzolk justinretzolk added provider Pertains to the provider itself, rather than any interaction with AWS. and removed needs-triage Waiting for first response or review from a maintainer. labels Aug 30, 2022
@RealFatCat
Copy link
Author

Ok, it seems like I understand where this leaks come from.

Crossplane runs terraform-provider-aws in test mode to keep it running.
It means, that c.process is not set, and the value is nil here:
https://github.com/hashicorp/go-plugin/blob/master/client.go#L860

When client.Kill() runs, it almost immediately checks c.process and in our case, just returns:
https://github.com/hashicorp/go-plugin/blob/master/client.go#L414

So, connections are not closed => memory leaks at provider.

The easy way to reproduce is to run terraform-provider-aws --debug, and constantly run terraform plan against provider in debug mode.

Copy link

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

@github-actions github-actions bot added the stale Old or inactive issues managed by automation, if no further action taken these will get closed. label Sep 29, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2024
Copy link

github-actions bot commented Dec 5, 2024

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
provider Pertains to the provider itself, rather than any interaction with AWS. service/ec2 Issues and PRs that pertain to the ec2 service. service/vpc Issues and PRs that pertain to the vpc service. stale Old or inactive issues managed by automation, if no further action taken these will get closed.
Projects
None yet
Development

No branches or pull requests

2 participants