Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a metric which counts how many chunks couldn't fit all transactions from the pool #10422

Merged
merged 9 commits into from
Jan 17, 2024

Conversation

jancionear
Copy link
Contributor

When a chunk is produced, Client calls prepare_transactions(), which fetches transactions from the transaction pool and adds them to the chunk. prepare_transactions() adds transactions until some limit is hit, a limit on gas, time, number of transactions, chunk size, etc.

Currently there's no way to know if some limit was hit and which limit was it. Let's add a way to access this information, it's useful to know what's the bottleneck of chunk production.

This PR adds a new metric which counts how many of the produced chunks couldn't fit all transactions from the transaction pool due to hitting a limit. It has two labels: shard_id - chunk's shard and limited_by - which of the limits was hit in prepare_transactions().

The need for this metric was discussed in #10310 (comment). The hope is that it'll allow us to figure out what's the bottleneck of chunk production in scenarios of high chain congestion.
Right now in cases where produce_chunk takes a lot of time we can only make theories why that's the case, but there's no observability into what's going inside. This metric could help with that.

To test that the metric works I set up a mocknet network with a build from this branch and put a load on it using locust.
I reduced produce_chunk_add_transactions_time_limit from 200ms to 2ms to easily trigger the metric. It worked, as can be observed on the grafana dashboard: https://nearinc.grafana.net/d/eada7f01-b2dc-4df8-8a9f-ec4ec411159e/jancio-mocknet-stats?orgId=1&from=1705135200000&to=1705135800000
Screenshot 2024-01-13 at 15 21 12

When a chunk is produced, Client calls `prepare_transactions()`,
which fetches transactions from the pool and adds them to the chunk.

`prepare_transactions()` adds transactions until some limit is hit,
a limit on gas, time, number of transactions, chunk size, etc.
Currently `prepare_transacations()` doesn't report which limit was hit,
it just returns a list of processed transactions. Let's add a way to access
this information, it's useful to know what's the bottleneck of chunk production.

The return type of `prepare_transactions()` is changed to a struct which
contains the list of prepared transactions and the limit which was hit.
This allows the caller to find out what limited the produced chunk.
In the future the information about hitting a limit can be exposed in the metrics.
…dn't fit all transactions

Add a metric which counts the number produced chunks where some transactions
from the transaciton pool didn't fit into the chunk due to a limit.
The metric has a label `limited_by`, which lets us know which limit was hit when producing the chunk.
This will help to find out what's the bottleneck of chunk production in scenarios of high chain congestion.
@jancionear jancionear requested a review from a team as a code owner January 13, 2024 14:24
@jancionear
Copy link
Contributor Author

The name of the metric is kinda awkward: near_produced_chunks_some_pool_transactions_didnt_fit.

I spent some time thinking of a better one, but in the end I settled on this one. The thing I like about it is that reading this metric name is enough to understand exactly what it means. Adjectives like limited, congested, overflowed can mean a lot of things, but here the situation is clearly stated - not all pool transactions made it into the chunk.

Naming is hard. I'm open for suggestions, here are some AI-generated alternatives (IMO all of them worse):

chunk_produced_congested_total
produced_chunks_full_of_transactions_total
produced_chunks_with_full_transactions
produced_chunks_transactions_full_total
more_tx_than_processable
produced_chunks_some_pool_transactions_didnt_fit
produced_chunks_transactions_overflowed
chunk_overflow_total
chunk_capacity_exceeded_total
chunks_with_unfit_tx_total
chunk_production_overflow_total
chunk_unfilled_transactions_total
chunk_capacity_missed_total
chunk_produced_unfit_tx_total
excess_tx_chunk_count
overflowed_chunk_production_total
unaccommodated_tx_chunks_total

Copy link

codecov bot commented Jan 14, 2024

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (c536f46) 71.91% compared to head (a05ecc3) 72.02%.
Report is 3 commits behind head on master.

Files Patch % Lines
nearcore/src/runtime/mod.rs 86.95% 3 Missing ⚠️
chain/chain/src/types.rs 0.00% 1 Missing and 1 partial ⚠️
chain/client/src/client.rs 96.15% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10422      +/-   ##
==========================================
+ Coverage   71.91%   72.02%   +0.11%     
==========================================
  Files         720      720              
  Lines      145051   144902     -149     
  Branches   145051   144902     -149     
==========================================
+ Hits       104308   104362      +54     
+ Misses      35954    35758     -196     
+ Partials     4789     4782       -7     
Flag Coverage Δ
backward-compatibility 0.08% <0.00%> (+<0.01%) ⬆️
db-migration 0.08% <0.00%> (+<0.01%) ⬆️
genesis-check 1.26% <0.00%> (+<0.01%) ⬆️
integration-tests 36.82% <92.04%> (+0.12%) ⬆️
linux 71.55% <91.25%> (+0.11%) ⬆️
linux-nightly 71.58% <89.77%> (+0.11%) ⬆️
macos 53.76% <65.00%> (-0.99%) ⬇️
pytests 1.48% <0.00%> (+<0.01%) ⬆️
sanity-checks 1.27% <0.00%> (+<0.01%) ⬆️
unittests 68.11% <68.18%> (+<0.01%) ⬆️
upgradability 0.13% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

There is some unrelated file attached to this PR at the end please remove it
cc @staffik or @nagisa Do you happen to know where that wallet_contract.was file is coming from?

hint for grafana - in case you want to use that dashboard - you can set nice labels by setting the label to custom and formatting only fields that you are interested in e.g. {{node_id}}

@@ -311,6 +311,22 @@ pub struct ApplyChunkShardContext<'a> {
pub is_first_block_with_chunk_of_version: bool,
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PreparedTransactions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typically we use tripple / for struct and method comments

/// Contains transactions ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed 👍

}

#[derive(Debug, Clone, Copy, PartialEq, Eq, strum::AsRefStr)]
pub enum PrepareTransactionsLimit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

let Self { chain, sharded_tx_pool, epoch_manager, runtime_adapter: runtime, .. } = self;

let shard_id = shard_uid.shard_id as ShardId;
let next_epoch_id = epoch_manager.get_epoch_id_from_prev_block(prev_block_header.hash())?;
let protocol_version = epoch_manager.get_epoch_protocol_version(&next_epoch_id)?;

let transactions = if let Some(mut iter) = sharded_tx_pool.get_pool_iterator(shard_uid) {
let prepared = if let Some(mut iter) = sharded_tx_pool.get_pool_iterator(shard_uid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prepared_transactions to keep it consistent or just transactions to indicate it's going to be the transactins field of PreparedTransactions

nearcore/src/runtime/mod.rs Show resolved Hide resolved
result.limited_by = Some(PrepareTransactionsLimit::ReceiptCount);
break;
}
if let Some(tlimit) = &time_limit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if let Some(time_limit) = &time_limit would be more rusty way to do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct is to avoid shadowing variables at all cost, it's easier on the brain when one name corresponds to one object. And I guess I had many bad experiences with bugs caused by shadowing things.
But okay, I'll rename tlimit totime_limit if that's more rusty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I agree with you, in this particular case, one could argue it's still the same thing, the only difference is the type and that is checked by the compiler.

@wacban
Copy link
Contributor

wacban commented Jan 15, 2024

also it takes a lot of willpower than I'm willing to admit to not look for a better name ;)

@jancionear
Copy link
Contributor Author

There is some unrelated file attached to this PR at the end please remove it
cc @staffik or @nagisa Do you happen to know where that wallet_contract.was file is coming from?

oops, previously I ran locust loadtests on this machine (ft.py, congestion.py), so maybe that's where this came from.
Will remove, I didn't notice it in the diff.

In a previous commit I remove wallet_contract.wasm
thinking that it shouldn't be tracked by git, but
it turns out that it should. Let's take wallet_contract.wasm
from master and add it to the git. The effect will be that
there will be no change to wallet_contract.wasm
@jancionear
Copy link
Contributor Author

v2:

  • removed extraneous file
  • Added comments on the new structs
  • Renamed variables to improve clarity

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@@ -311,6 +311,22 @@ pub struct ApplyChunkShardContext<'a> {
pub is_first_block_with_chunk_of_version: bool,
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PreparedTransactions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typically we use tripple / for struct and method comments

/// Contains transactions ...

@nagisa
Copy link
Collaborator

nagisa commented Jan 15, 2024

Do you happen to know where that wallet_contract.was file is coming from?

I don’t. I have noticed that running the full test suite will generate those files for me as well though. If we don't expect it to be checked in, we should put it in target/… I suspect there's a build script somewhere that compiles into a working directory somewhere… I’ll look.

@staffik
Copy link
Contributor

staffik commented Jan 15, 2024

Do you happen to know where that wallet_contract.was file is coming from?

I don’t. I have noticed that running the full test suite will generate those files for me as well though. If we don't expect it to be checked in, we should put it in target/… I suspect there's a build script somewhere that compiles into a working directory somewhere… I’ll look.

@nagisa It is generated by runtime/near-wallet-contract/build.rs. There will be a follow up PR where it will be generated differently: #10269 (comment).
But for now, I will create a tiny PR tomorrow that stops auto-generating this file.

github-merge-queue bot pushed a commit that referenced this pull request Jan 16, 2024
In some circumstances, `runtime/near-wallet-contract/build.rs` is
triggered and it regenerates the Wallet Contract WASM file for unrelated
PRs:
#10422 (comment).
The `Wallet Contract` will be generated differently
(#10269 (comment)),
but for now I would like to include this tiny PR to stops
auto-generating the WASM file by default.

To test that the WASM file is not auto-regenerated, add a space to
`runtime/near-wallet-contract/wallet-contract/src/lib.rs` and run `cargo
build --features nightly`.
@jancionear
Copy link
Contributor Author

The code coverage check didn't pass because there are no tests for the new metric.
It would be hard to test this metric - I would have to write an integration test which generates a heavy load that triggers congestion control and increases the metric. I tested the metric manually on a mocknet setup, it should be enough, merging.

@jancionear jancionear added this pull request to the merge queue Jan 17, 2024
Merged via the queue into near:master with commit 790d663 Jan 17, 2024
14 of 15 checks passed
@jancionear jancionear deleted the congested-chunk-metric branch January 17, 2024 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants