Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix peer down-scoring behaviour when gossip blobs/columns are received after getBlobs or reconstruction #6686

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Dec 12, 2024

Issue Addressed

Fixed a peer disconnection issue recently reported by @hangleang.

On PeerDAS devnet, he noticed that the proposer Grandline node gets disconnect by Lighthouse after publishing the columns. Logs show that Lighthouse performed reconstruction after receiving 50% (64 of 128) data columns, and ignored the remaining columns received from gossip. Additionally Lighthouse also penalise the peer - even though it's a HighToleranceError, due to the large number of messages for data columns that we can receive after reconstruction (50%, or up to 64 columns per slot), the node got disconnected immediately. He also noticed that it happened not only on Grandine, but also on Lighthouse peers as well.

Logs:

Nov 19 04:28:44.943 DEBG On-time head block                      set_as_head_time_ms: 7, imported_time_ms: 442, attestable_delay_ms: 1493, available_delay_ms: 1493, execution_time_ms: 3, consensus_time_ms: 65, blob_delay_ms: 1493, observed_delay_ms: 1338, total_delay_ms: 1942, slot: 40, proposer_index: 181, block_root: 0x740b40a44a9611ffddf53c1e7ffbaa88d21592e885e877fd14615e12e3acb426, service: beacon
Nov 19 04:28:44.944 DEBG Issuing engine_forkchoiceUpdated        current_slot: 40, head_block_root: 0x740b40a44a9611ffddf53c1e7ffbaa88d21592e885e877fd14615e12e3acb426, head_block_hash: 0x756e98dc89ccbab138b899d903513299899b5b4a0dbe828413a18e82d3621b76, justified_block_hash: 0x0000000000000000000000000000000000000000000000000000000000000000, finalized_block_hash: 0x0000000000000000000000000000000000000000000000000000000000000000, service: exec
Nov 19 04:28:44.944 DEBG No change in canonical head             head: 0x740b40a44a9611ffddf53c1e7ffbaa88d21592e885e877fd14615e12e3acb426, service: beacon
Nov 19 04:28:44.944 DEBG No change in canonical head             head: 0x740b40a44a9611ffddf53c1e7ffbaa88d21592e885e877fd14615e12e3acb426, service: beacon
Nov 19 04:28:44.944 DEBG Could not verify column sidecar for gossip. Ignoring the column sidecar, index: 117, block_root: 0x740b40a44a9611ffddf53c1e7ffbaa88d21592e885e877fd14615e12e3acb426, slot: 40, error: PriorKnown { proposer: 181, slot: Slot(40), index: 117 }
Nov 19 04:28:44.944 DEBG Peer score adjusted                     score: -5.99, peer_id: 16Uiu2HAmVgsM5dYcu4WL4ww2j3Wp8V5vBNCmkgWRDnPUbPC9CKhh, msg: gossip_data_column_high, service: libp2p

We should not penalise the node for sending us valid blobs / data columns on time. This is less obvious in Deneb as due to the lower message count, but gets worse in PeerDAS with higher message count for columns and reconstruction.

Proposed Changes

Remove peer penalty for blobs and columns that are received from gossip for the first time, but already available via other channels, such as getBlobs, supernode column reconstruction, or RPC.

Note that duplicates received after the first message is filtered by gossip. We only send IDONTWANT when publishing columns, and because we do gradual publication, IDONTWANT may not be sent to peers immediately - probably something to consider: send IDONTWANT immediately and but gradually publish to avoid excessive bandwidth usage.

Thanks @hangleang for reporting this!

…hey are recieved from the EL or available via column reconstruction.
@jimmygchen jimmygchen added ready-for-review The code is ready for review das Data Availability Sampling labels Dec 12, 2024
@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Dec 16, 2024
@jimmygchen
Copy link
Member Author

@mergify queue

Copy link

mergify bot commented Dec 16, 2024

queue

🛑 The pull request has been removed from the queue default

Pull request #6686 has been dequeued by a dequeue command.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@jimmygchen
Copy link
Member Author

@mergify dequeue

Copy link

mergify bot commented Dec 16, 2024

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #6686 has been dequeued by a dequeue command.

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Copy link

mergify bot commented Dec 16, 2024

dequeue

✅ The pull request has been removed from the queue default

@jimmygchen
Copy link
Member Author

@mergify requeue

Copy link

mergify bot commented Dec 16, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Dec 16, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 847c801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants