Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBTS: Proposal's reception time should be persisted to the WAL #2322

Closed
3 tasks
Tracked by #1731
cason opened this issue Feb 13, 2024 · 8 comments · Fixed by #2388
Closed
3 tasks
Tracked by #1731

PBTS: Proposal's reception time should be persisted to the WAL #2322

cason opened this issue Feb 13, 2024 · 8 comments · Fixed by #2388
Assignees
Labels
consensus pbts protobuf Protocol buffers-related

Comments

@cason
Copy link
Collaborator

cason commented Feb 13, 2024

Originally: tendermint/tendermint#8954

For implementing PBTS mechanism, in particular for registering the reception time of Proposal message, the consensus reactor registers the ReceiveTime of every message in the msgInfo instance delivered to the consensus protocol.

Messages received and accepted by the consensus protocol are written to the WAL, allowing nodes to recover from crashes. However, the msgInfo proto, that is written to the WAL, does not include the ReceiveTime field.

As a result, when replaying messages from the WAL their ReceiveTime is nil (zero time), which leads nodes to reject a replayed Proposal message upon recovery, even though the same Proposal message was accepted during regular execution. This leads recovering validator nodes to equivocate.

@sergio-mena has observed this behavior in e2e tests and reported it in detail in tendermint/tendermint#8739. He also proposed a workaround to avoid tests from failing (tendermint/tendermint#8774), by disabling the PBTS timely verification of the Proposal's timestamp when in replay mode. The problem of this approach is that it can also lead to equivocating behavior: if the Proposal's timestamp is rejected by PBTS during regular execution, then upon recovery the same Proposal can be accepted, as the timely verification is disabled.

The straightforward solution for this issue is to write the ReceiveTime field of msgInfo struct to the WAL during regular operation, and to read the same field from the WAL when replaying messages, so that the protocol processes Proposal messages using the same temporal information employed before, during regular operation.

The downside of this solution is to increase the size and the amount of the information written to the WAL, with a potential performance impact. The byte-size of a timestamp is typically 13 bytes, but the main point here is that the ReceiveTime is not relevant for most of the received and broadcast messages. In fact, this field is only employed for Proposal messages. For all the other messages, having it set to zero (nil) during recovery does not constitute a problem.

DoD

@cason cason added consensus protobuf Protocol buffers-related pbts labels Feb 13, 2024
@cason cason added this to CometBFT Feb 13, 2024
@github-project-automation github-project-automation bot moved this to Todo in CometBFT Feb 13, 2024
@cason
Copy link
Collaborator Author

cason commented Feb 13, 2024

The behaviour of the TestWALCrash consensus unit test reflects this issue, see: #2320 (comment)

@lasarojc lasarojc moved this from Todo to In Progress in CometBFT Feb 20, 2024
@lasarojc lasarojc self-assigned this Feb 20, 2024
@lasarojc
Copy link
Contributor

We could add the timestamp as optional to protobuf and only fill it for proposal messages. This should reduce the impact of the field on the WAL.

@sergio-mena
Copy link
Contributor

sergio-mena commented Feb 21, 2024

Another thought: as we are not activating PBTS upon upgrade, but via the "enable height" in the future, we don't need to consider backwards compatibility of the proposal format in the WAL (I'm happy we adopted that way of activating PBTS).

We need to consider, tho, that non-PBTS heights may be missing the optional field also in the proposal (which I don't think it is a problem, as BFT Time does not need it).

@lasarojc
Copy link
Contributor

PR #2388 improves test TestWALEncoderDecoder, which fails if applied to feature/pbts as follows (excerpt):

        	Error Trace:	/Users/lasaro/code/informal/cometbft/internal/consensus/wal_test.go:135
        	Error:      	Not equal: 
        	            	expected: consensus.msgInfo{Msg:(*consensus.ProposalMessage)(0x140000e2310), PeerID:"Nobody", ReceiveTime:time.Date(2024, time.February, 21, 12, 24, 11, 230093000, time.UTC)}
        	            	actual  : consensus.msgInfo{Msg:(*consensus.ProposalMessage)(0x140000e2378), PeerID:"Nobody", ReceiveTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC)}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -35,4 +35,4 @@
        	            	  ReceiveTime: (time.Time) {
        	            	-  wall: (uint64) 230093000,
        	            	-  ext: (int64) 63844115051,
        	            	+  wall: (uint64) 0,
        	            	+  ext: (int64) 0,
        	            	   loc: (*time.Location)(<nil>)
        	Test:       	TestWALEncoderDecoder

The test passes on #2388

@cason
Copy link
Collaborator Author

cason commented Feb 22, 2024

We need to consider, tho, that non-PBTS heights may be missing the optional field also in the proposal (which I don't think it is a problem, as BFT Time does not need it).

If we can indeed make this field option, we are good here. For BFT Time this field is irrelevant, for PBTS, if it was not saved is because we were not using PBTS or we are an incorrect node.

lasarojc added a commit that referenced this issue Feb 27, 2024
Closes #2322 

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Sergio Mena <sergio@informal.systems>
@cason
Copy link
Collaborator Author

cason commented Feb 27, 2024

Is this solved?

@sergio-mena
Copy link
Contributor

@cason Yes, the description of #2388 says it closes this issue, and it's merged already. I think it wasn't closed automatically because the PR is not on main. Closing it

@sergio-mena
Copy link
Contributor

Closed by #2388

@github-project-automation github-project-automation bot moved this from In Progress to Done in CometBFT Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus pbts protobuf Protocol buffers-related
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants