Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce duplicate peer traffic for ledger data #5126

Merged
merged 44 commits into from
Feb 14, 2025

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Sep 11, 2024

High Level Overview of Change

Several changes to help reduce message traffic and improve logging and visibility.

  • Suppress duplicate TMGetLedger and TMLedgerData messages, reducing the overhead of processing those messages.
  • Reduce the number of those messages sent to peers.
  • Improve logging related to those messages, as well as a few other areas.
  • Introduces a new protocol version to gate a new feature on the TMLedgerData message, which allows multiple identical requests to be replied to with one message.

These changes are organized into several commits which are organized logically separating each functional operation. They can be merged as-is, or squashed.

Context of Change

Analysis of the issue that led to #5115 identified heavy TMGetLedger request and TMLedgerData response traffic between nodes leading up to the syncing incidents. It was later determined that those messages were more a symptom of the problem, and not the root cause. However, leading up to identification of the root cause, these changes were being implemented to cut down on those messages, detect duplicates, etc. That reduction in unnecessary traffic is still valuable, so it's being included here.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance (increase or change in throughput and/or latency)

API Impact

None.

Test Plan

  1. Verify that nodes are at least as reliable at staying in sync as they are now.
  2. Verify that the indicated messages are sent less frequently by upgraded nodes.
  3. Verify that peers that both have this change send fewer ledger messages to each other.

Future Tasks

I am still working on a follow-up to #4764 that makes use of these changes and other improvements to reduce the number of requests initiated by a given node in the first place.

@ximinez ximinez added the Perf Attn Needed Attention needed from RippleX Performance Team label Sep 11, 2024
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 24.34988% with 320 lines in your changes missing coverage. Please review.

Project coverage is 77.9%. Comparing base (7c9d652) to head (5438161).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
src/xrpld/overlay/detail/PeerImp.cpp 0.0% 210 Missing ⚠️
src/xrpld/overlay/detail/ProtocolMessage.h 0.0% 32 Missing ⚠️
src/xrpld/app/misc/NetworkOPs.cpp 21.4% 22 Missing ⚠️
src/xrpld/overlay/detail/PeerSet.cpp 20.8% 19 Missing ⚠️
src/xrpld/app/ledger/detail/InboundLedgers.cpp 67.9% 17 Missing ⚠️
src/xrpld/app/ledger/detail/InboundLedger.cpp 54.5% 5 Missing ⚠️
src/xrpld/app/ledger/InboundLedger.h 60.0% 4 Missing ⚠️
src/xrpld/app/misc/HashRouter.cpp 60.0% 4 Missing ⚠️
src/xrpld/app/ledger/detail/LedgerMaster.cpp 0.0% 3 Missing ⚠️
src/xrpld/app/misc/HashRouter.h 77.8% 2 Missing ⚠️
... and 2 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5126     +/-   ##
=========================================
- Coverage     78.2%   77.9%   -0.2%     
=========================================
  Files          790     791      +1     
  Lines        67653   67975    +322     
  Branches      8161    8321    +160     
=========================================
+ Hits         52879   52968     +89     
- Misses       14774   15007    +233     
Files with missing lines Coverage Δ
include/xrpl/basics/base_uint.h 96.9% <100.0%> (+<0.1%) ⬆️
include/xrpl/protocol/LedgerHeader.h 100.0% <ø> (ø)
src/xrpld/app/ledger/detail/TimeoutCounter.cpp 88.4% <100.0%> (+1.3%) ⬆️
src/xrpld/app/ledger/detail/TimeoutCounter.h 100.0% <ø> (ø)
src/xrpld/app/misc/NetworkOPs.h 100.0% <ø> (ø)
src/xrpld/overlay/Peer.h 100.0% <ø> (ø)
src/xrpld/overlay/detail/PeerImp.h 12.8% <ø> (ø)
src/xrpld/overlay/detail/ProtocolVersion.cpp 86.4% <ø> (ø)
include/xrpl/basics/CanProcess.h 95.5% <95.5%> (ø)
src/xrpld/app/consensus/RCLConsensus.cpp 65.4% <0.0%> (ø)
... and 10 more

... and 6 files with indirect coverage changes

Impacted file tree graph

@ximinez ximinez force-pushed the pr/getledger branch 3 times, most recently from 09c4156 to c69b443 Compare September 11, 2024 14:48
ximinez and others added 5 commits September 11, 2024 11:09
* Allow a retry after 30s in case of peer or network congestion.
* Addresses RIPD-1870
* (Changes levelization. That is not desirable, and will need to be
  fixed.)
* Allow a retry after 15s in case of peer or network congestion.
* Collate duplicate TMGetLedger requests:
  * The requestCookie is ignored when computing the hash, thus increasing
    the chances of detecting duplicate messages.
  * With duplicate messages, keep track of the different requestCookies
    (or lack of cookie). When work is finally done for a given request,
    send the response to all the peers that are waiting on the request,
    sending a separate message for each requestCookie.
* Addresses RIPD-1871
* Addresses RIPD-1869

---------

Co-authored-by: Valentin Balaschenko <13349202+vlntb@users.noreply.github.com>
Co-authored-by: Ed Hennis <ed@ripple.com>
* When work is done for a given TMGetLedger request, send the
  response to all the peers that are waiting on the request,
  sending one message per peer, including all the cookies and
  a "directResponse" flag indicating the data is intended for the
  sender, too.
@ximinez ximinez changed the title Reduce ledger protocol message traffic Reduce peer message traffic for ledger data Sep 11, 2024
@ximinez ximinez requested a review from mtrippled September 11, 2024 20:37
@ximinez ximinez added this to the 2.3.0 (August 2024) milestone Sep 11, 2024
@ximinez ximinez marked this pull request as ready for review September 11, 2024 20:45
insert()
{
std::unique_lock<Mutex> lock_(mtx_);
bool exists = collection_.contains(item_);
Copy link
Collaborator

@Bronek Bronek Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would avoid extra lookup

auto [_, inserted] = collection_.insert(item_);
return inserted;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

* Avoid an unnecessary lookup in CanProcess
* upstream/develop:
  Set version to 2.3.0-b4
  feat(SQLite): allow configurable database pragma values (5135)
  refactor: re-order PRAGMA statements (5140)
  fix(book_changes): add "validated" field and reduce RPC latency (5096)
  chore: fix typos in comments (5094)
  Set version to 2.2.3
  Update SQLite3 max_page_count to match current defaults (5114)
@vlntb vlntb assigned vlntb and unassigned vlntb Oct 1, 2024
@vlntb vlntb self-requested a review October 1, 2024 10:57
Copy link
Collaborator

@vlntb vlntb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments. Happy to approve once addressed.

include/xrpl/basics/base_uint.h Show resolved Hide resolved
return ledger;
}

JLOG(p_journal_.trace())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: Should this be a warn instead of trace? Not having a peer to relay the request may indicate some configuration or environment issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, though. It could mean that the node has already sent the request to all it's peers. Also not that the original message is trace.

But there's something odd here. It looks like this code block was somehow duplicated! I must have messed up resolving a conflict when I rebased from master to develop. I've removed the duplicate.

@@ -2936,7 +3078,9 @@ getPeerWithLedger(
void
PeerImp::sendLedgerBase(
std::shared_ptr<Ledger const> const& ledger,
protocol::TMLedgerData& ledgerData)
protocol::TMLedgerData& ledgerData,
std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>> const&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>> is mentioned in four places. It would be more readable to define an alias:
using PeerCookieMap = std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>>;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Fixed.

* Add unit tests for to_short_string(base_uint
* Remove duplicated code
* Use type aliases for cookie maps
* That's what I get for rushing to push
* upstream/develop:
  Set version to 2.4.0-b3
  Set version to 2.3.1
  Update conan in the "nix" CI jobs
  Add RPC "simulate" to execute a dry run of a transaction (5069)
  Set version to 2.3.1-rc1
  Reduce the peer charges for well-behaved peers:
* upstream/develop:
  Add deep freeze feature (XLS-77d) (5187)
* upstream/develop:
  Amendment `fixFrozenLPTokenTransfer` (5227)
  Improve git commit hash lookup (5225)
* upstream/develop:
  Updates Conan dependencies (5256)
* upstream/develop:
  fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275)
  fix: issues in `simulate` RPC (5265)
@bthomee
Copy link
Collaborator

bthomee commented Feb 10, 2025

@Bronek Any outstanding issues for @ximinez to address?

* upstream/develop:
  fix: Amendment to add transaction flag checking functionality for Credentials (5250)
  fix: Omit superfluous setCurrentThreadName call in GRPCServer.cpp (5280)
case CONSENSUS:
return "CONSENSUS";
default:
assert(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace with UNREACHABLE and a nice & short description why (see CONTRIBUTING.md , section " Contracts and instrumentation" for naming guideline)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace with UNREACHABLE and a nice & short description why (see CONTRIBUTING.md , section " Contracts and instrumentation" for naming guideline)

Done.

{
// Done. Something else (probably consensus) built the ledger
// locally while waiting for data (or possibly before requesting)
assert(isDone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace with XRPL_ASSERT and a short description why; see CONTRIBUTING.md for contracts and instrumentation naming guidelines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace with XRPL_ASSERT and a short description why; see CONTRIBUTING.md for contracts and instrumentation naming guidelines

Done


for (auto const peerID : s.peekPeerSet())
{
if (!callback(peerID))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How fast are these callbacks - we are running them over a loop inside critical section. Might need to think of an alternative solution, to avoid hogging the mutex_. Not a major issue, because as indicated in first sentence - I do not know if these callbacks are fast or slow (or we don't know ?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callbacks are very fast, but now I don't remember why I wanted to do this under lock.
PeerImp.cpp line 3528

        std::set<HashRouter::PeerShortID> peers;
        app_.getHashRouter().forEachPeer(
            mHash, [&](HashRouter::PeerShortID const& peerID) {
                // callback is called under lock, so finish as fast as
                // possible.
                peers.insert(peerID);
                return true;
            });

Maybe I wanted avoid making a copy of the std::set? But this is effectively making a copy of the set anyway, so that's dumb. I'd guess I originally wrote it with the loop contents inside the callback, but decided that was too slow, and didn't think before just copying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 10 assert added here, can you please turn them to XRPL_ASSERT, following naming guideline in CONTRIBUTING.md ?

Copy link
Collaborator

@Bronek Bronek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor remarks related to use of assert, once fixed feel free to merge

Note to self: add unit tests for PeerImp.cpp in a future PR

@Bronek Bronek self-requested a review February 11, 2025 18:19
Copy link
Collaborator

@Bronek Bronek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I realized I added a comment re. use of a loop and callback inside critical section. A sufficient explanation why this is safe might be sufficient to address this comment, but for now this is not approved.

@Bronek
Copy link
Collaborator

Bronek commented Feb 11, 2025

@Bronek Any outstanding issues for @ximinez to address?

Yes, just added some. The loop and callback inside critical section can be probably resolved with an explanation e.g. a comment in code. The assert ones are trivial to fix.

* upstream/develop:
  chore: Rename missing-commits job, and combine nix job files (5268)
  docs: Add a summary of the git commit message rules (5283)
* upstream/develop:
  fix: Replace charge() by fee_.update() in OnMessage functions (5269)
  docs: ensure build_type and CMAKE_BUILD_TYPE match (5274)
  chore: Fix small typos in protocol files (5279)
@ximinez
Copy link
Collaborator Author

ximinez commented Feb 13, 2025

Commit message:

Reduce duplicate peer traffic for ledger data (#5126)

- Drop duplicate outgoing TMGetLedger messages per peer
  - Allow a retry after 30s in case of peer or network congestion.
  - Addresses RIPD-1870
  - (Changes levelization. That is not desirable, and will need to be fixed.)
- Drop duplicate incoming TMGetLedger messages per peer
  - Allow a retry after 15s in case of peer or network congestion.
  - The requestCookie is ignored when computing the hash, thus increasing
    the chances of detecting duplicate messages.
  - With duplicate messages, keep track of the different requestCookies
    (or lack of cookie). When work is finally done for a given request,
    send the response to all the peers that are waiting on the request,
    sending one message per peer, including all the cookies and
    a "directResponse" flag indicating the data is intended for the
    sender, too.
  - Addresses RIPD-1871
- Drop duplicate incoming TMLedgerData messages
  - Addresses RIPD-1869
- Improve logging related to ledger acquisition
- Class "CanProcess" to keep track of processing of distinct items

---------

Co-authored-by: Valentin Balaschenko <13349202+vlntb@users.noreply.github.com>

@ximinez ximinez requested a review from Bronek February 14, 2025 00:08
- Convert a bunch of old asserts to XRPL_ASSERT
- Change HashRouter::forEachPeer to getPeers, which just returns the
  set
@ximinez ximinez added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Feb 14, 2025
@ximinez ximinez changed the title Reduce peer message traffic for ledger data Reduce duplicate peer traffic for ledger data Feb 14, 2025
@ximinez ximinez merged commit dd5e655 into XRPLF:develop Feb 14, 2025
21 of 22 checks passed
@ximinez ximinez deleted the pr/getledger branch February 14, 2025 23:51
Copy link
Collaborator

@Bronek Bronek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ximinez , this is just to confirm that you have covered all the feedback I had for this PR, and I particularly like how you changed forEachPeer into getPeers and also paid attention to contract names. Nice !

@legleux legleux mentioned this pull request Feb 18, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Perf Attn Needed Attention needed from RippleX Performance Team Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants