feat: add a metric which counts how many chunks couldn't fit all transactions from the pool #10422

jancionear · 2024-01-13T14:24:52Z

When a chunk is produced, Client calls prepare_transactions(), which fetches transactions from the transaction pool and adds them to the chunk. prepare_transactions() adds transactions until some limit is hit, a limit on gas, time, number of transactions, chunk size, etc.

Currently there's no way to know if some limit was hit and which limit was it. Let's add a way to access this information, it's useful to know what's the bottleneck of chunk production.

This PR adds a new metric which counts how many of the produced chunks couldn't fit all transactions from the transaction pool due to hitting a limit. It has two labels: shard_id - chunk's shard and limited_by - which of the limits was hit in prepare_transactions().

The need for this metric was discussed in #10310 (comment). The hope is that it'll allow us to figure out what's the bottleneck of chunk production in scenarios of high chain congestion.
Right now in cases where produce_chunk takes a lot of time we can only make theories why that's the case, but there's no observability into what's going inside. This metric could help with that.

To test that the metric works I set up a mocknet network with a build from this branch and put a load on it using locust.
I reduced produce_chunk_add_transactions_time_limit from 200ms to 2ms to easily trigger the metric. It worked, as can be observed on the grafana dashboard: https://nearinc.grafana.net/d/eada7f01-b2dc-4df8-8a9f-ec4ec411159e/jancio-mocknet-stats?orgId=1&from=1705135200000&to=1705135800000

When a chunk is produced, Client calls `prepare_transactions()`, which fetches transactions from the pool and adds them to the chunk. `prepare_transactions()` adds transactions until some limit is hit, a limit on gas, time, number of transactions, chunk size, etc. Currently `prepare_transacations()` doesn't report which limit was hit, it just returns a list of processed transactions. Let's add a way to access this information, it's useful to know what's the bottleneck of chunk production. The return type of `prepare_transactions()` is changed to a struct which contains the list of prepared transactions and the limit which was hit. This allows the caller to find out what limited the produced chunk. In the future the information about hitting a limit can be exposed in the metrics.

…dn't fit all transactions Add a metric which counts the number produced chunks where some transactions from the transaciton pool didn't fit into the chunk due to a limit. The metric has a label `limited_by`, which lets us know which limit was hit when producing the chunk. This will help to find out what's the bottleneck of chunk production in scenarios of high chain congestion.

jancionear · 2024-01-13T14:30:50Z

The name of the metric is kinda awkward: near_produced_chunks_some_pool_transactions_didnt_fit.

I spent some time thinking of a better one, but in the end I settled on this one. The thing I like about it is that reading this metric name is enough to understand exactly what it means. Adjectives like limited, congested, overflowed can mean a lot of things, but here the situation is clearly stated - not all pool transactions made it into the chunk.

Naming is hard. I'm open for suggestions, here are some AI-generated alternatives (IMO all of them worse):

chunk_produced_congested_total
produced_chunks_full_of_transactions_total
produced_chunks_with_full_transactions
produced_chunks_transactions_full_total
more_tx_than_processable
produced_chunks_some_pool_transactions_didnt_fit
produced_chunks_transactions_overflowed
chunk_overflow_total
chunk_capacity_exceeded_total
chunks_with_unfit_tx_total
chunk_production_overflow_total
chunk_unfilled_transactions_total
chunk_capacity_missed_total
chunk_produced_unfit_tx_total
excess_tx_chunk_count
overflowed_chunk_production_total
unaccommodated_tx_chunks_total

codecov · 2024-01-14T22:25:08Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (c536f46) 71.91% compared to head (a05ecc3) 72.02%.
Report is 3 commits behind head on master.

Files	Patch %	Lines
nearcore/src/runtime/mod.rs	86.95%	3 Missing ⚠️
chain/chain/src/types.rs	0.00%	1 Missing and 1 partial ⚠️
chain/client/src/client.rs	96.15%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10422      +/-   ##
==========================================
+ Coverage   71.91%   72.02%   +0.11%     
==========================================
  Files         720      720              
  Lines      145051   144902     -149     
  Branches   145051   144902     -149     
==========================================
+ Hits       104308   104362      +54     
+ Misses      35954    35758     -196     
+ Partials     4789     4782       -7

Flag	Coverage Δ
backward-compatibility	`0.08% <0.00%> (+<0.01%)`	⬆️
db-migration	`0.08% <0.00%> (+<0.01%)`	⬆️
genesis-check	`1.26% <0.00%> (+<0.01%)`	⬆️
integration-tests	`36.82% <92.04%> (+0.12%)`	⬆️
linux	`71.55% <91.25%> (+0.11%)`	⬆️
linux-nightly	`71.58% <89.77%> (+0.11%)`	⬆️
macos	`53.76% <65.00%> (-0.99%)`	⬇️
pytests	`1.48% <0.00%> (+<0.01%)`	⬆️
sanity-checks	`1.27% <0.00%> (+<0.01%)`	⬆️
unittests	`68.11% <68.18%> (+<0.01%)`	⬆️
upgradability	`0.13% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wacban

LGTM, thanks!

There is some unrelated file attached to this PR at the end please remove it
cc @staffik or @nagisa Do you happen to know where that wallet_contract.was file is coming from?

hint for grafana - in case you want to use that dashboard - you can set nice labels by setting the label to custom and formatting only fields that you are interested in e.g. {{node_id}}

wacban · 2024-01-15T10:28:31Z

chain/chain/src/types.rs

@@ -311,6 +311,22 @@ pub struct ApplyChunkShardContext<'a> {
    pub is_first_block_with_chunk_of_version: bool,
 }

+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct PreparedTransactions {


Can you add a small comment please?

nit: typically we use tripple / for struct and method comments

/// Contains transactions ...

Right, fixed 👍

wacban · 2024-01-15T10:28:54Z

chain/chain/src/types.rs

+}
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq, strum::AsRefStr)]
+pub enum PrepareTransactionsLimit {


wacban · 2024-01-15T10:32:42Z

chain/client/src/client.rs

        let Self { chain, sharded_tx_pool, epoch_manager, runtime_adapter: runtime, .. } = self;

        let shard_id = shard_uid.shard_id as ShardId;
        let next_epoch_id = epoch_manager.get_epoch_id_from_prev_block(prev_block_header.hash())?;
        let protocol_version = epoch_manager.get_epoch_protocol_version(&next_epoch_id)?;

-        let transactions = if let Some(mut iter) = sharded_tx_pool.get_pool_iterator(shard_uid) {
+        let prepared = if let Some(mut iter) = sharded_tx_pool.get_pool_iterator(shard_uid) {


nit: prepared_transactions to keep it consistent or just transactions to indicate it's going to be the transactins field of PreparedTransactions

nearcore/src/runtime/mod.rs

wacban · 2024-01-15T10:35:56Z

nearcore/src/runtime/mod.rs

+                result.limited_by = Some(PrepareTransactionsLimit::ReceiptCount);
+                break;
+            }
+            if let Some(tlimit) = &time_limit {


nit: if let Some(time_limit) = &time_limit would be more rusty way to do it

My instinct is to avoid shadowing variables at all cost, it's easier on the brain when one name corresponds to one object. And I guess I had many bad experiences with bugs caused by shadowing things.
But okay, I'll rename tlimit totime_limit if that's more rusty.

Generally speaking I agree with you, in this particular case, one could argue it's still the same thing, the only difference is the type and that is checked by the compiler.

wacban · 2024-01-15T10:40:00Z

also it takes a lot of willpower than I'm willing to admit to not look for a better name ;)

jancionear · 2024-01-15T10:45:50Z

There is some unrelated file attached to this PR at the end please remove it
cc @staffik or @nagisa Do you happen to know where that wallet_contract.was file is coming from?

oops, previously I ran locust loadtests on this machine (ft.py, congestion.py), so maybe that's where this came from.
Will remove, I didn't notice it in the diff.

In a previous commit I remove wallet_contract.wasm thinking that it shouldn't be tracked by git, but it turns out that it should. Let's take wallet_contract.wasm from master and add it to the git. The effect will be that there will be no change to wallet_contract.wasm

jancionear · 2024-01-15T11:53:07Z

v2:

removed extraneous file
Added comments on the new structs
Renamed variables to improve clarity

wacban

thanks!

wacban · 2024-01-15T16:31:38Z

chain/chain/src/types.rs

@@ -311,6 +311,22 @@ pub struct ApplyChunkShardContext<'a> {
    pub is_first_block_with_chunk_of_version: bool,
 }

+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct PreparedTransactions {


nit: typically we use tripple / for struct and method comments

/// Contains transactions ...

nagisa · 2024-01-15T17:03:09Z

Do you happen to know where that wallet_contract.was file is coming from?

I don’t. I have noticed that running the full test suite will generate those files for me as well though. If we don't expect it to be checked in, we should put it in target/… I suspect there's a build script somewhere that compiles into a working directory somewhere… I’ll look.

staffik · 2024-01-15T21:42:48Z

Do you happen to know where that wallet_contract.was file is coming from?

I don’t. I have noticed that running the full test suite will generate those files for me as well though. If we don't expect it to be checked in, we should put it in target/… I suspect there's a build script somewhere that compiles into a working directory somewhere… I’ll look.

@nagisa It is generated by runtime/near-wallet-contract/build.rs. There will be a follow up PR where it will be generated differently: #10269 (comment).
But for now, I will create a tiny PR tomorrow that stops auto-generating this file.

In some circumstances, `runtime/near-wallet-contract/build.rs` is triggered and it regenerates the Wallet Contract WASM file for unrelated PRs: #10422 (comment). The `Wallet Contract` will be generated differently (#10269 (comment)), but for now I would like to include this tiny PR to stops auto-generating the WASM file by default. To test that the WASM file is not auto-regenerated, add a space to `runtime/near-wallet-contract/wallet-contract/src/lib.rs` and run `cargo build --features nightly`.

jancionear · 2024-01-17T13:17:09Z

The code coverage check didn't pass because there are no tests for the new metric.
It would be hard to test this metric - I would have to write an integration test which generates a heavy load that triggers congestion control and increases the metric. I tested the metric manually on a mocknet setup, it should be enough, merging.

jancionear added 2 commits January 13, 2024 06:30

jancionear requested a review from wacban January 13, 2024 14:24

jancionear requested a review from a team as a code owner January 13, 2024 14:24

fix compilation with test_features feature enabled

137c740

wacban approved these changes Jan 15, 2024

View reviewed changes

jancionear added 5 commits January 15, 2024 10:51

remove wallet_contract.wasm, was added by mistake

dfcf9cb

Add comments to the new structs

be494fa

Rename prepared to prepared_transactions to keep things consistent

524019b

Rename tlimit to time_limit

280efae

jancionear requested a review from wacban January 15, 2024 11:53

wacban approved these changes Jan 15, 2024

View reviewed changes

Use three slashes for comments on structs

9da69fd

staffik mentioned this pull request Jan 16, 2024

Disable auto-generation of wallet contract by default #10436

Merged

jancionear added this pull request to the merge queue Jan 17, 2024

Merged via the queue into near:master with commit 790d663 Jan 17, 2024
14 of 15 checks passed

jancionear deleted the congested-chunk-metric branch January 17, 2024 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a metric which counts how many chunks couldn't fit all transactions from the pool #10422

feat: add a metric which counts how many chunks couldn't fit all transactions from the pool #10422

jancionear commented Jan 13, 2024

jancionear commented Jan 13, 2024

codecov bot commented Jan 14, 2024 •

edited

Loading

wacban left a comment

wacban Jan 15, 2024

wacban Jan 15, 2024

jancionear Jan 15, 2024

wacban Jan 15, 2024

wacban Jan 15, 2024

wacban Jan 15, 2024

jancionear Jan 15, 2024

wacban Jan 15, 2024

wacban commented Jan 15, 2024

jancionear commented Jan 15, 2024

jancionear commented Jan 15, 2024

wacban left a comment

wacban Jan 15, 2024

nagisa commented Jan 15, 2024

staffik commented Jan 15, 2024

jancionear commented Jan 17, 2024

feat: add a metric which counts how many chunks couldn't fit all transactions from the pool #10422

feat: add a metric which counts how many chunks couldn't fit all transactions from the pool #10422

Conversation

jancionear commented Jan 13, 2024

jancionear commented Jan 13, 2024

codecov bot commented Jan 14, 2024 • edited Loading

Codecov Report

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban commented Jan 15, 2024

jancionear commented Jan 15, 2024

jancionear commented Jan 15, 2024

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagisa commented Jan 15, 2024

staffik commented Jan 15, 2024

jancionear commented Jan 17, 2024

codecov bot commented Jan 14, 2024 •

edited

Loading