Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider/aws: serialize SG rule access to fix race condition #3965

Merged
merged 1 commit into from
Nov 18, 2015

Conversation

phinze
Copy link
Contributor

@phinze phinze commented Nov 18, 2015

Because aws_security_group_rule resources are an abstraction on top of
Security Groups, they must interact with the AWS Security Group APIs in
a pattern that often results in lots of parallel requests interacting
with the same security group.

We've found that this pattern can trigger race conditions resulting in
inconsistent behavior, including:

  • Rules that report as created but don't actually exist on AWS's side
  • Rules that show up in AWS but don't register as being created
    locally, resulting in follow up attempts to authorize the rule
    failing w/ Duplicate errors

Here, we introduce a per-SG mutex that must be held by any security
group before it is allowed to interact with AWS APIs. This protects the
space between DescribeSecurityGroup and Authorize* / Revoke*
calls, ensuring that no other rules interact with the SG during that
span.

The included test exposes the race by applying a security group with
lots of rules, which based on the dependency graph can all be handled in
parallel. This fails most of the time without the new locking behavior.

I've omitted the mutex from Read, since it is only called during the
Refresh walk when no changes are being made, meaning a bunch of parallel
DescribeSecurityGroup API calls should be consistent in that case.

// The initial use case is to let aws_security_group_rule resources serialize
// their access to individual security groups based on SG ID.
type MutexKV struct {
sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally don't embed these in public structs, as Lock() and Unlock() become public as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Fixing

@ryanuber
Copy link
Member

@phinze this looks good, just a few comments above!

@phinze phinze force-pushed the b-aws-sg-rules-v2-race branch 2 times, most recently from e0486f5 to bb0a9b4 Compare November 18, 2015 18:14
@mitchellh
Copy link
Contributor

@phinze This looks great. I would recommend moving the MutexKV to a helper package, but otherwise 👍 Also consider some tests for it.

Because `aws_security_group_rule` resources are an abstraction on top of
Security Groups, they must interact with the AWS Security Group APIs in
a pattern that often results in lots of parallel requests interacting
with the same security group.

We've found that this pattern can trigger race conditions resulting in
inconsistent behavior, including:

 * Rules that report as created but don't actually exist on AWS's side
 * Rules that show up in AWS but don't register as being created
   locally, resulting in follow up attempts to authorize the rule
   failing w/ Duplicate errors

Here, we introduce a per-SG mutex that must be held by any security
group before it is allowed to interact with AWS APIs. This protects the
space between `DescribeSecurityGroup` and `Authorize*` / `Revoke*`
calls, ensuring that no other rules interact with the SG during that
span.

The included test exposes the race by applying a security group with
lots of rules, which based on the dependency graph can all be handled in
parallel. This fails most of the time without the new locking behavior.

I've omitted the mutex from `Read`, since it is only called during the
Refresh walk when no changes are being made, meaning a bunch of parallel
`DescribeSecurityGroup` API calls should be consistent in that case.
@phinze
Copy link
Contributor Author

phinze commented Nov 18, 2015

Moved to helper w/ unit tests. 👍

@ryanuber
Copy link
Member

🚢

phinze added a commit that referenced this pull request Nov 18, 2015
provider/aws: serialize SG rule access to fix race condition
@phinze phinze merged commit a211fc3 into master Nov 18, 2015
@phinze phinze deleted the b-aws-sg-rules-v2-race branch November 18, 2015 18:47
@ghost
Copy link

ghost commented Apr 29, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants