Poor RPC performance #5821

bakhtin · 2022-07-27T12:20:31Z

I'm running a Westend RPC node that exhibits a poor performance serving RPC calls (both WS and HTTP).

RPC calls execution time varies greatly on whether it is first or subsequent call. The most notable example being state_getMetadata method that can take anywhere from 8ms to 18s. I recorded two profiles using perf during the execution of state_getMetadata method and rendered a flamegraph:

Slow (first) RPC call

Source: https://transfer.sh/e52S3b/perf-slow.svg
Fast (subsequent) RPC call

Source: https://transfer.sh/RQXIUs/perf-fast.svg

On the first flamegraph most of the CPU time was spent compiling a WASM code and only 2.39% of the time actually executing it. During the WASM compilation I observe a high CPU utilization (100% per vCPU, depending on the number of in-flight requests).

I plotted a graph for the median distribution of the RPC calls execution time over the past two days (y-axis being microseconds)

state_getMetadata method alone is being executed on average 1.23 times per second.

Each time a method execution falls into a slow RPC call bucket and requires WASM compilation, a node uses 100% of the vCPU. As there are other methods being called it significantly increases the resources usage and impacts end-user experience. It also leads to a possibility to cause a DoS by utilizing all of the available resources.

The above behavior is observed on both GCP VM instance and inside a container running in a K8s cluster
System info 1 (VM):
Debian 10
Linux kernel: 4.19.0-18-cloud-amd64
Polkadot binary v0.9.26
CLI parameters: polkadot --detailed-log-output --name westend-rpc-1 --unsafe-ws-external --rpc-methods Safe --rpc-cors * --chain westend --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<redacted>/tcp/30333 --in-peers 25 --out-peers 25 --db-cache 512 --telemetry-url wss://<redacted> --pruning=archive -lsync=trace,rpc_metrics=debug --prometheus-external --prometheus-port 9615 --ws-port 9944 --ws-max-connections 5000 --rpc-port 9933

System info 1 (container):
Google Kubernetes engine: v1.22.10-gke.600
Linux kernel 5.10.109+
Polkadot image v0.9.26
CLI parameters: polkadot --name=westend-rpc-node-0 --base-path=/data/ --chain=westend --database=rocksdb --pruning=archive --prometheus-external --prometheus-port 9615 --unsafe-rpc-external --unsafe-ws-external --rpc-methods=safe --rpc-cors=all --ws-max-connections 5000 --telemetry-url wss://<redacted> --db-cache 512 --in-peers 25 --out-peers 25 -lrpc_metrics=debug --listen-addr=/ip4/0.0.0.0/tcp/30333

The text was updated successfully, but these errors were encountered:

bkchr · 2022-07-27T13:39:47Z

The issue name is misleading. We need to compile the wasm code to call into it. Not sure what expects to happen.

You can run with --runtime-cache-size=10 or something higher to keep more compiled wasm blobs in the cache.

bakhtin · 2022-07-28T09:24:44Z

I'd expect the less variance in response times even in presence of WASM code compilation for every request. Anyways, I'll try adjusting the runtime-cache-size as you suggested and see the results.

koute · 2022-07-28T09:40:28Z

This is expected. If an RPC call requires a call into the runtime then we need to have that runtime compiled and ready for instantiation (otherwise we can't call into it). So we need to compile it, and that takes time.

Although one improvement we could make here is if we'd enable the executor's on-disk cache, which would allow wasmtime to skip compiling a given runtime without having to increase the in-memory runtime cache size as long as it has previously compiled a given runtime at least once before. (So the RPC time would still take a long time the very first time you call it, but any subsequent call for that given runtime, even if the node is restarted, should be relatively fast, since it could then reload the compiled runtime from disk.) We have the code to do that, but IIRC this is currently hardcoded to essentially be always disabled.

bakhtin closed this as completed Jul 28, 2022

This was referenced Sep 21, 2022

Bump wasmtime to 1.0.0 paritytech/substrate#12317

Merged

Investigate Cranelift's new incremental compilation cache paritytech/polkadot-sdk#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor RPC performance #5821

Poor RPC performance #5821

bakhtin commented Jul 27, 2022 •

edited

Loading

bkchr commented Jul 27, 2022

bakhtin commented Jul 28, 2022

koute commented Jul 28, 2022

Poor RPC performance #5821

Poor RPC performance #5821

Comments

bakhtin commented Jul 27, 2022 • edited Loading

bkchr commented Jul 27, 2022

bakhtin commented Jul 28, 2022

koute commented Jul 28, 2022

bakhtin commented Jul 27, 2022 •

edited

Loading