Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Poor RPC performance #5821

Closed
bakhtin opened this issue Jul 27, 2022 · 3 comments
Closed

Poor RPC performance #5821

bakhtin opened this issue Jul 27, 2022 · 3 comments

Comments

@bakhtin
Copy link
Contributor

bakhtin commented Jul 27, 2022

I'm running a Westend RPC node that exhibits a poor performance serving RPC calls (both WS and HTTP).

RPC calls execution time varies greatly on whether it is first or subsequent call. The most notable example being state_getMetadata method that can take anywhere from 8ms to 18s. I recorded two profiles using perf during the execution of state_getMetadata method and rendered a flamegraph:

  1. Slow (first) RPC call
    perf-slow
    Source: https://transfer.sh/e52S3b/perf-slow.svg

  2. Fast (subsequent) RPC call
    perf-fast
    Source: https://transfer.sh/RQXIUs/perf-fast.svg

On the first flamegraph most of the CPU time was spent compiling a WASM code and only 2.39% of the time actually executing it. During the WASM compilation I observe a high CPU utilization (100% per vCPU, depending on the number of in-flight requests).

I plotted a graph for the median distribution of the RPC calls execution time over the past two days (y-axis being microseconds)
image

state_getMetadata method alone is being executed on average 1.23 times per second.
image

Each time a method execution falls into a slow RPC call bucket and requires WASM compilation, a node uses 100% of the vCPU. As there are other methods being called it significantly increases the resources usage and impacts end-user experience. It also leads to a possibility to cause a DoS by utilizing all of the available resources.

The above behavior is observed on both GCP VM instance and inside a container running in a K8s cluster
System info 1 (VM):
Debian 10
Linux kernel: 4.19.0-18-cloud-amd64
Polkadot binary v0.9.26
CLI parameters: polkadot --detailed-log-output --name westend-rpc-1 --unsafe-ws-external --rpc-methods Safe --rpc-cors * --chain westend --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<redacted>/tcp/30333 --in-peers 25 --out-peers 25 --db-cache 512 --telemetry-url wss://<redacted> --pruning=archive -lsync=trace,rpc_metrics=debug --prometheus-external --prometheus-port 9615 --ws-port 9944 --ws-max-connections 5000 --rpc-port 9933

System info 1 (container):
Google Kubernetes engine: v1.22.10-gke.600
Linux kernel 5.10.109+
Polkadot image v0.9.26
CLI parameters: polkadot --name=westend-rpc-node-0 --base-path=/data/ --chain=westend --database=rocksdb --pruning=archive --prometheus-external --prometheus-port 9615 --unsafe-rpc-external --unsafe-ws-external --rpc-methods=safe --rpc-cors=all --ws-max-connections 5000 --telemetry-url wss://<redacted> --db-cache 512 --in-peers 25 --out-peers 25 -lrpc_metrics=debug --listen-addr=/ip4/0.0.0.0/tcp/30333

@bkchr
Copy link
Member

bkchr commented Jul 27, 2022

The issue name is misleading. We need to compile the wasm code to call into it. Not sure what expects to happen.

You can run with --runtime-cache-size=10 or something higher to keep more compiled wasm blobs in the cache.

@bakhtin
Copy link
Contributor Author

bakhtin commented Jul 28, 2022

I'd expect the less variance in response times even in presence of WASM code compilation for every request. Anyways, I'll try adjusting the runtime-cache-size as you suggested and see the results.

@bakhtin bakhtin closed this as completed Jul 28, 2022
@koute
Copy link
Contributor

koute commented Jul 28, 2022

This is expected. If an RPC call requires a call into the runtime then we need to have that runtime compiled and ready for instantiation (otherwise we can't call into it). So we need to compile it, and that takes time.

Although one improvement we could make here is if we'd enable the executor's on-disk cache, which would allow wasmtime to skip compiling a given runtime without having to increase the in-memory runtime cache size as long as it has previously compiled a given runtime at least once before. (So the RPC time would still take a long time the very first time you call it, but any subsequent call for that given runtime, even if the node is restarted, should be relatively fast, since it could then reload the compiled runtime from disk.) We have the code to do that, but IIRC this is currently hardcoded to essentially be always disabled.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants