Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

security/p2p: prevent peers who errored being added to the peer_set #9500

Merged
merged 9 commits into from
Oct 6, 2022

Conversation

jmalicevic
Copy link
Contributor

@jmalicevic jmalicevic commented Sep 30, 2022

This work is a fix for a bug in the P2P layer.

A node can be attacked via the p2p layer by saturating its incoming connection slots and not allowing the node to accept new conditions. This happens when an attacker continuously submits requests to connect with an erroneous message causing the incoming request to error before it has been accepted. The attacked node, tries to remove the peer from its peer set which silently fails (due to the peer not yet being in the peer set). The routine adding a peer into the peer set happens in parallel in the background and will add the peer after the error has been reported.

This fix resolves the issue in the following way:

  1. We add a field removalAttemptFailed to the Peer datastructure.
  2. If removal of this peer fails, we set it to true.
  3. When adding a peer into the peer set, the Add function will return an ErrPeerRemoval error if this field was true and not add the peer.

Note. This attack does not work if the config flag allow_duplicate_ips is set to false.


PR checklist

  • Tests written/updated, or no tests needed
  • CHANGELOG_PENDING.md updated, or no changelog entry needed
  • Updated relevant documentation (docs/) and code comments, or no
    documentation updates needed

@jmalicevic jmalicevic added C:p2p Component: P2P pkg T:security Type: Security (specify priority) priority A high-priority, high-level item to be shown on the priorities project board labels Sep 30, 2022
@jmalicevic jmalicevic self-assigned this Sep 30, 2022
@jmalicevic jmalicevic requested a review from a team September 30, 2022 07:20
@jmalicevic jmalicevic changed the title security/p2p: detect peers silently erroring and being added as peers after the fact security/p2p: prevent peers who errored being added to the peer_set Sep 30, 2022
p2p/switch.go Outdated Show resolved Hide resolved
p2p/switch.go Outdated Show resolved Hide resolved
Comment on lines +50 to +52
if peer.GetRemovalFailed() {
return ErrPeerRemoval{}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this check before calling PeerSet.Add here. I would prefer to do it on line 813 of switch.go before initializing the peer to the individual reactors

Copy link
Contributor

@thanethomson thanethomson Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reordering this as you've recommended @cmwaters not change the effectiveness of the fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by effectiveness? As in it will no longer fix it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this check happens anywhere earlier we introduce the exact same race condition we already heaving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can absolutely do it before and if this flag happens to be set we don't do anything but we also do have to do it here as well to be sure we are not in the same position as w/o the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this check happens anywhere earlier we introduce the exact same race condition we already heaving.

I see. I wasn't aware of that. Need to look in more closely then

@thanethomson thanethomson added S:backport-to-v0.34.x Tell mergify to backport the PR to v0.34.x S:backport-to-v0.37.x Tell mergify to backport the PR to v0.37.x and removed priority A high-priority, high-level item to be shown on the priorities project board labels Sep 30, 2022
jmalicevic and others added 2 commits September 30, 2022 13:02
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
p2p/switch.go Outdated
// This is done to avoid the corner case of adding a peer when this error happens before the
// AddPeer routine has finished. Setting this flag will prevent a peer from being added
// to a node's peer set.
peer.SetRemovalFailed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to do this here without the peer set being locked in the same way it is when calling GetRemovalFailed here? (the GetRemovalFailed call happens while the peer set's mutex is locked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had intuitively implemented with a lock in a previous version (although by setting a lock on the peer itself, not on the entire set). But I did not notice the attack being more successful so I opted to not add a mutex into the peer structure itself even though I still do a small window of opportunity for the removal to happen first and then the Add to grab a lock and add the peer before we set the removal failed flag. Thinking about it now, I could actually set the flag inside the Remove function above the line that returns false and not do a lock on the peer itself. I will do a local test with this change to examine the behaviour unless you have objections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a local test I think it is better to put it inside the Remove. Will confirm on the testnet before pushing the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not change the behaviour much but is definetly more correct for this to be inside Remove. I moved it there now. Tested both locally and on DO.

Copy link
Contributor

@thanethomson thanethomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we should go ahead and merge this. Next steps from here:

  • Backport to v0.37.x and v0.34.x, and add a CHANGELOG_PENDING.md entry to each backport PR.
  • Run a large-scale testnet (a single test) on v0.34.x with this fix and check that things work as expected.
  • Cut v0.34.22.
  • Keep the v0.37.x backport branch open until we've cut v0.37.0, then merge it.
  • Run another large-scale testnet on v0.37.x to check that things work before cutting v0.37.1.

Unless anyone has any objections there?

@jmalicevic
Copy link
Contributor Author

Thanks! I'll create an issue with the items you listed, I think they all make sense are definitely worth doing. Do you think we should do a large testnet test of whether the attacked is prevented? If so, we can add this item to the list and I'll need to patch the testnet scripts to be able to execute the attack.

@thanethomson
Copy link
Contributor

Do you think we should do a large testnet test of whether the attacked is prevented? If so, we can add this item to the list and I'll need to patch the testnet scripts to be able to execute the attack.

The main thing we'd need to know is whether the P2P network is still stable and that the fix doesn't meaningfully affect peer connectivity. I don't think it needs to be a long test. I assume that your previous tests showed that the fix worked in larger-scale testnets?

@jmalicevic
Copy link
Contributor Author

I assume that your previous tests showed that the fix worked in larger-scale testnets?

I tested that a more aggressive attack does not bring the node down and that the attack is prevented. I did not run a 200node test with the attack. I also don;t think it would reveal additional information. But it was worth testing that when we multiple the attackers the node is still behaving correctly. So I agree with you, we should confirm that this fix does not break something else rather then attack a node within a bigger network.

@jmalicevic jmalicevic merged commit c0bdb24 into main Oct 6, 2022
@jmalicevic jmalicevic deleted the jasmina/p2p-bugfix branch October 6, 2022 07:02
mergify bot pushed a commit that referenced this pull request Oct 6, 2022
…9500)

* Mark failed removal of peer to address security bug

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
(cherry picked from commit c0bdb24)
mergify bot pushed a commit that referenced this pull request Oct 6, 2022
…9500)

* Mark failed removal of peer to address security bug

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
(cherry picked from commit c0bdb24)
jmalicevic added a commit that referenced this pull request Oct 7, 2022
…backport #9500) (#9516)

* security/p2p: prevent peers who errored being added to the peer_set (#9500)

* Mark failed removal of peer to address security bug

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
(cherry picked from commit c0bdb24)

* Changelong entry and added missing functions for implementations of Peer

Co-authored-by: Jasmina Malicevic <jasmina.dustinac@gmail.com>
jmalicevic added a commit that referenced this pull request Oct 10, 2022
@cason
Copy link
Contributor

cason commented Oct 11, 2022

This is an elegant workaround. As Jasmina mentioned, while I agree we should assess the p2p layer as a whole in large testnet, I don't see how a larger environment would affect this form of attack or compromise the solution.

Olshansk added a commit to pokt-network/tendermint that referenced this pull request Apr 5, 2023
)

**tl;dr DOS mitigation migrated from [tendermint/tendermint/pull/9500](tendermint/tendermint#9500

Validated use LocalNet instructions at [doc/guides/localnet.md](https://github.com/pokt-network/pocket-core/blob/staging/doc/guides/localnet.md)

Original PR description:

```
This work is a fix for a bug in the P2P layer.

A node can be attacked via the p2p layer by saturating its incoming connection slots and not allowing the node to accept new conditions. This happens when an attacker continuously submits requests to connect with an erroneous message causing the incoming request to error before it has been accepted. The attacked node, tries to remove the peer from its peer set which silently fails (due to the peer not yet being in the peer set). The routine adding a peer into the peer set happens in parallel in the background and will add the peer after the error has been reported.

This fix resolves the issue in the following way:

We add a field removalAttemptFailed to the Peer datastructure.
If removal of this peer fails, we set it to true.
When adding a peer into the peer set, the Add function will return an ErrPeerRemoval error if this field was true and not add the peer.
Note. This attack does not work if the config flag allow_duplicate_ips is set to false.
```
adrianbrink pushed a commit to heliaxdev/tendermint that referenced this pull request May 23, 2023
…216)

* security/p2p: prevent peers who errored being added to the peer_set (backport tendermint#9500) (#87)

* security/p2p: prevent peers who errored being added to the peer_set (tendermint#9500)

* Mark failed removal of peer to address security bug

* Added p2p bugfix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:p2p Component: P2P pkg S:backport-to-v0.34.x Tell mergify to backport the PR to v0.34.x S:backport-to-v0.37.x Tell mergify to backport the PR to v0.37.x T:security Type: Security (specify priority)
Projects
Status: Done/Merged
Development

Successfully merging this pull request may close these issues.

4 participants