Update hardware requirements for `benchmark machine` #13308

ggwpez · 2023-02-03T15:36:25Z

Run the benchmark machine on new reference hardware and update the requirements JSON file. Showed up on SE.

The text was updated successfully, but these errors were encountered:

99Kies · 2023-02-04T09:53:16Z

What is the configuration now?

ggwpez · 2023-02-04T13:46:48Z

What is the configuration now?

https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware

99Kies · 2023-02-04T16:00:21Z

@ggwpez I know this, but benchmark doesn't pass successfully after my machine is configured.

ggwpez · 2023-02-06T16:52:53Z

You can find it here #13317

koute · 2023-02-16T14:12:58Z

So we're going to run the benches on cloud VMs right now? Hm, I understand that's what most people are running so in a way it makes sense, although it might make dealing with the benchmarks more painful due to the inherent variability of cloud machines.

Two random possibilities I see which could make this better.

Just use bare metal anyway

Run the benchmarks on bare metal, and manually calculate a static multiplier(s) which would allow us to translate the results from bare hardware into the cloud machine, and have those as official weights.

Use multiple cloud machines in parallel

The benchmark bot is triggered to run benches.
It runs the benches not on one machine, but on X machines in parallel (where X is picked high enough to make them "stable"); every run generates a .json with the results
After they all finish the bot takes all of the results and merges them into one big json, and then generates the weights for that. (I've added the ability to generate the weights from the json file without running the benches so this is mostly done; you'd just need to write some simple program to merge the jsons)

Basically you'd essentially have the benchmarks which average the results from one machine (so it reduces the variability within that single VM), and a "meta" benchmark like this which sits a lever higher and averages results from multiple machines (which reduces the variability across the VMs).

We'd also have to make sure that our VMs are going to be put on separate machines. Not sure if/which cloud providers provide such service, but an easy way to guarantee this would be to just pick the same type of VMs in multiple regions.

rcny · 2023-02-16T14:35:08Z

About Use multiple cloud machines in parallel:

GCP supports VM instance placement policies
This could be done for GitLab almost natively and I think it could be applied after some tweaking around to GHA too

About Just use bare metal anyway:

IMO sticking with this setup here gives us less benefits than going cloud all-in because of the required maintenance overhead for these bare metal hosts. The engineering resources' investment that should be put into the approach with calculating the median for cloud VMs would benefit us much more in the end.

Overall, my opinion is that using a fully cloud-based system here would be more beneficial than continuing to use the bare metal setup, and that the resources allocated towards maintaining the current setup would be better used in developing a cloud-based solution.

bkchr · 2023-02-16T14:37:59Z

If we can come to reasonable results using the cloud VMs, they should be fine. However, for this we need to test @koute approach.

ggwpez · 2023-02-16T15:00:47Z

TLDR: FRAME-benchmarking could be the reason and not hardware.

Comparing them cross-runtime is normally a good indication whether or not they are consistent, even on the same machine.
Since if they are not; we can be sure that the benchmark is unstable.

The hardware stats of the VMs are very consistent from my past measurements. That the weights dont follow suit could come from too few repetitions.
Especially for short No-OPs like remark_with_event, we probably have to implement a dynamic number of repetitions.
Currently it is doing about 100 reps for that in the standard config, which is a very short time-frame on the VM.
Instead we can first do a phase where the variance is measured and then calculate how many repetitions it would need to "converge". Now if it does not converge, we could emit an error that the hardware is broken.

athei · 2023-02-16T15:23:02Z

TLDR: FRAME-benchmarking could be the reason and not hardware.

We should aim for the most reliable solution with the least noise here. So why not both? Improve FRAME and still run on machines with predictable load (bare metal).

bkchr · 2023-02-16T15:50:39Z

@rcny I talked with @athei and we came to the agreement that we should continue running the bare metal machines until either FRAME-benchmarking is working properly or we come up with some different solution. We can not sacrifice our work here because maintaining the bare metal machines is more workload. I trust you with this and we will make sure to find a solution we are all happy with, but until this is done please bring back the bare metal machines.

athei · 2023-02-16T15:53:09Z

What this means concretely:

Revert the baseline back to bare metal
The bench bot and the manual "bench all" job should use bare metal by default

It is no problem to have the VMs in parallel to try them out. But we should not switch over until everything works for some time.

rcny · 2023-02-17T10:16:21Z

Tracked in https://github.com/paritytech/ci_cd/issues/740.

alvicsam · 2023-02-17T11:14:20Z

The baseline in substrate wasn't changed. There was an attempt in #13316 but it failed because of inconsistent results for pallet-contracts. I will ask @paritytech/opstooling to adjust bench-bot settings to use bare-metal machine for substrate.

koute · 2023-02-17T14:05:41Z

To me it feels like we have two somewhat conflicting use-cases here:

Benchmark to generate weights.
Benchmark to see whether the performance has changed.

For (1):

we want a configuration that's equivalent to what people are running in the wild,
we want this blessed hardware/software configuration to not change over long periods of time,
we want to build a linear model over the raw data (that is, the weights) so that we can roughly predict how much something's going to cost to run

For (2):

we don't necessarily need to match what people are running in the wild (if we make something faster on hardware X then it will in the vast majority of cases also be faster on hardware Y, unless we engage in microarchitectural-level optimization, which we usually don't)
hardware which is as fast as possible so that the benchmarks finish as fast as possible and we don't spend a long time waiting
short-term rock solid run-to-run stability of the results
raw data and other metrics which are appropriate for performance comparisons (which weights are not since those can produce relatively huge run-to-run variances even if the performance didn't change purely due to how they're calculated; yes, I know we've been using them like that up until this point, but this is really suboptimal)
we want the ability to see whether a specific PR changes the performance, and only that PR (instead of what we do currently where we often compare the freshly generated weights to the weights which already exist in the repository which effectively benchmarks dozens/hundreds of other PRs besides ours and introduces a lot of noise)

Proposal

So what would make sense to me would be:

Further improve FRAME benchmarking machinery with what @ggwpez proposed, and add more metrics besides weights which are more appropriate for performance comparisons (e.g. just essentially copy-paste what criterion's doing)
Have separate sets of machines for use-case (1) and (2). (I don't really think having to maintain a few bare-metal machines for performance benchmarking is a big deal; if no one wants to do it I'll happily volunteer and maintain them for you. :P)
Have the benchmarking bot support two modes:
a) Weight generation mode. This would generate weights just as it does now, and run on cloud VMs.
b) Performance benchmarking mode. You'd trigger this bot in a PR and this would run the benchmark twice on a bare metal machine: one time without the changes from the PR, and another time with the changes from the PR, and then it would compare them using appropriate metrics and display a nice automatic analysis (and this wouldn't generate weights, since, again, those are not really appropriate for this use case); kinda like rustc's perf bot

Another nice thing about (b) would be that we could also hook other non-FRAME benchmarks to that, e.g. some of the tests I usually manually run when we upgrade wasmtime I could just have a bot do them automatically.

mordamax · 2023-02-17T17:36:54Z

@alvicsam

I will ask https://github.com/orgs/paritytech/teams/opstooling to adjust bench-bot settings to use bare-metal machine for substrate.

Even though it's relatively easy to implement, I'd first propose alternative:
-> if only frame_benchmarking is the problem, can we re-generate ALL weights to VM baseline, and only frame_benchmarking leave on bm3 (using workaround bot bench -v PIPELINE_SCRIPTS_REF=bm-fallback $ pallet dev frame_benchmarking) until it's fixed

@koute

Have the benchmarking bot support two modes

sounds reasonable & for sure we can make both modes work through command bot.
For example:
bot weights -> all pallets on VM + commit new weights
bot bench $ pallet dev frame_benchmarking -> bm + as a result output a link to see a diff

bkchr · 2023-02-17T20:43:33Z

if only frame_benchmarking is the problem, can we re-generate ALL weights to VM baseline

No it isn't about the crate frame-benchmarking. We referred to the runtime benchmarking as "frame-benchmarking".

athei · 2023-02-20T13:39:51Z

Even if we had those two sets of benchmarking machines: It will still be a problem if the weight machines have a lot of noise. I think we will still have a problem merging those diffs even if the "performance machines" show no diff. Also, it will make the setup more complicated.

I think we should first try improve the benchmarking machinery and see if we can get rid of the noise as suggested by @ggwpez. Only if this fails we should look into further steps. But until then we need to be unblocked and continue using the old machines.

ggwpez · 2023-02-20T16:41:04Z

But until then we need to be unblocked and continue using the old machines

AFAIK you are unblocked and can continue to use the old machines as described above by @mordamax. The Substrate weights are not yet updated @athei

athei · 2023-02-20T16:52:02Z

Yes. I am using a "special" bench bot command. But it should be the default. Because anyone who doesn't know about this will be caught off guard scratching their heads about the diff.

mordamax · 2023-02-21T14:11:46Z

Yes. I am using a "special" bench bot command. But it should be the default. Because anyone who doesn't know about this will be caught off guard scratching their heads about the diff.

@athei @ggwpez @bkchr @koute What would be the long-term plan re VM vs BM? if it's not yet clear, than I'd suggest to use this "special" :) bench-bm command, which runs benchmarks on BM in a meanwhile, until we have an understanding VM, VM + BM or anything else :)
once it's more or less clear, i'll work on more "default"/standard way to deal with it so it's also clear for users and there's no confusion around this

alvicsam · 2023-02-21T14:49:20Z

As I understood from the thread it's possible to adjust benchmarks in substrate to run them in VM runners but it requires additional coding described in #13308 (comment). So until this is addressed substrate is going to use BM runners.
FYI Cumulus and Polkadot are already using new VM runners without issues.

bkchr · 2023-02-22T09:27:18Z

@athei @ggwpez @bkchr @koute What would be the long-term plan re VM vs BM?

Quoting what @athei said above:

I think we should first try improve the benchmarking machinery and see if we can get rid of the noise as suggested by @ggwpez. Only if this fails we should look into further steps. But until then we need to be unblocked and continue using the old machines.

if it's not yet clear, than I'd suggest to use this "special" :) bench-bm command, which runs benchmarks on BM in a meanwhile, until we have an understanding VM, VM + BM or anything else :)

I would propose to do exactly what @athei proposed above your post. The default is to use BM and then have some special command for VM.

FYI Cumulus and Polkadot are already using new VM runners without issues.

The benchmarking works, but produces completely different results all the time, because the VM has a lot of noise. We also already have seen this in Cumulus benchmarks.

alvicsam · 2023-02-22T11:40:41Z

The benchmarking works, but produces completely different results all the time, because the VM has a lot of noise. We also already have seen this in Cumulus benchmarks.

Oh, that sucks and should be investigated :( Can you give some examples please?

alvicsam · 2023-02-22T12:09:09Z

I've found, for the reference:

paritytech/cumulus#2199
paritytech/cumulus#2198

bkchr · 2023-02-22T13:35:55Z

Oh, that sucks and should be investigated :(

Yes and that is also what we already have discussed here in the issue. We will investigate and hope that paritytech/polkadot-sdk#379 fixes it. However, until we know the proper fix we want to continue with the current machines.

alvicsam · 2023-02-23T10:54:44Z

BM machines and weights were reverted in paritytech/cumulus#2225 and paritytech/polkadot#6762. Bench bot is now using bm machine by default (reverted in paritytech/command-bot-scripts#14)

bkchr · 2023-02-23T12:21:54Z

Thank you @alvicsam

Lohann · 2023-08-08T16:42:20Z

What bare-metal instance are substrate using for the benchmarks? I would like to run the benchmarks independently to see if I can get the same results.

ggwpez · 2023-08-08T16:49:24Z

Substrate switched to VM servers as well. In theory the Wiki entry should be enough info to reproduce it.

If you want to exactly reproduce our CI setup, @oleg-plakida will have confirm it.
Personally I am using these args for a gcloud instance. Is that exactly the same as the bot, or what is the bot using @oleg-plakida?

--google-threads-per-core=1 \
--google-machine-type=n2-standard-8 \
--google-min-cpu-platform="Intel Ice Lake" \
--google-zone=europe-west1-b \
--google-machine-image="debian-11-bullseye" \
--google-disk-size=30 \
--google-disk-type='pd-ssd' \
--google-local-ssd=4 \
--engine-opt "data-root=/mnt/local/docker" \
...

Lohann · 2023-08-08T17:08:47Z

Ok I'm confusing, @alvicsam said here that polkadot/cumulus reverted to bare-metal instances, so I was wondering what bm instances you guys are using, I know before it was Open OVH Rise 2.

BM machines and weights were reverted in paritytech/cumulus#2225 and paritytech/polkadot#6762. Bench bot is now using bm machine by default (reverted in paritytech/command-bot-scripts#14)

So now it is using Google's VM's again?

mateo-moon · 2023-08-08T18:21:28Z

Ok I'm confusing, @alvicsam said here that polkadot/cumulus reverted to bare-metal instances, so I was wondering what bm instances you guys are using, I know before it was Open OVH Rise 2.

BM machines and weights were reverted in paritytech/cumulus#2225 and paritytech/polkadot#6762. Bench bot is now using bm machine by default (reverted in paritytech/command-bot-scripts#14)

So now it is using Google's VM's again?

Hi, Lohann. There have been a rough road to VM with some back and force migrations) But since about month ago all repos had been switched to VM. Also ‘bot bench’ command uses VM machines by default. You can refer to Oliver’s comment if you need to create testing env of your own. Please use the how-to https://github.com/paritytech/ci_cd/wiki/Gitlab:-Setup-new--runner-(bare-metal-and-VM)#create-runner-manually

mateo-moon · 2023-08-08T18:29:09Z

Substrate switched to VM servers as well. In theory the Wiki entry should be enough info to reproduce it.

If you want to exactly reproduce our CI setup, @oleg-plakida will have confirm it. Personally I am using these args for a gcloud instance. Is that exactly the same as the bot, or what is the bot using @oleg-plakida?
--google-threads-per-core=1 \
--google-machine-type=n2-standard-8 \
--google-min-cpu-platform="Intel Ice Lake" \
--google-zone=europe-west1-b \
--google-machine-image="debian-11-bullseye" \
--google-disk-size=30 \
--google-disk-type='pd-ssd' \
--google-local-ssd=4 \
--engine-opt "data-root=/mnt/local/docker" \
...

Yap. Mostly the same, except of some CI parameters which could be omitted. One small notice, i would recommend use the most fresh machine image:
--google-machine-image="parity-build/global/images/gitlab-runner-debian-11-bullseye-v20221206-v8" but it should be optional.

Lohann · 2023-08-10T15:12:06Z

@oleg-plakida thank you so much for your help, btw this repo is private for me:
https://github.com/paritytech/ci_cd/wiki/Gitlab:-Setup-new--runner-(bare-metal-and-VM)#create-runner-manually

One thing is not clear is how you guys solved the issues regarding VM's results, are substrate implementing the @koute multiple cloud machines in parallel for get an average weight? or does it simply runs in one VM?

ggwpez · 2023-08-10T16:23:32Z

The benchmark results were consistent enough to not induce any further action. We just run it once on a VM and then use the results. As far as I am aware, no obvious inconsistencies reported.

mateo-moon · 2023-08-10T17:09:05Z

@oleg-plakida thank you so much for your help, btw this repo is private for me: https://github.com/paritytech/ci_cd/wiki/Gitlab:-Setup-new--runner-(bare-metal-and-VM)#create-runner-manually

One thing is not clear is how you guys solved the issues regarding VM's results, are substrate implementing the @koute multiple cloud machines in parallel for get an average weight? or does it simply runs in one VM?

Oliver is right. Regarding access it’s only for parity team. You can contact me in matrix and i will find the way to pass this docs and custom binary of docker-machine to you.

ggwpez added U2-some_time_soon Issue is worth doing soon. Z0-trivial Writing the issue is of the same difficulty as patching the code. labels Feb 3, 2023

ggwpez added this to Benchmarking and Weights Feb 3, 2023

github-project-automation bot moved this to Backlog in Benchmarking and Weights Feb 3, 2023

ggwpez mentioned this issue Feb 6, 2023

Update Reference Hardware Specs #13317

Merged

paritytech deleted a comment from moliqingcha000 Feb 16, 2023

alvicsam mentioned this issue Feb 22, 2023

[ci] Return benchmark to bm machines paritytech/cumulus#2225

Merged

alvicsam mentioned this issue Feb 22, 2023

Revert back to bare metal runners for weights generation paritytech/polkadot#6762

Merged

paritytech-processbot bot closed this as completed in #13317 Jun 28, 2023

github-project-automation bot moved this from Backlog to Done in Benchmarking and Weights Jun 28, 2023

Update hardware requirements for benchmark machine #13308

Update hardware requirements for benchmark machine #13308

Comments

ggwpez commented Feb 3, 2023 • edited Loading

99Kies commented Feb 4, 2023

ggwpez commented Feb 4, 2023

99Kies commented Feb 4, 2023 • edited Loading

ggwpez commented Feb 6, 2023

koute commented Feb 16, 2023 • edited Loading

Just use bare metal anyway

Use multiple cloud machines in parallel

rcny commented Feb 16, 2023 • edited Loading

bkchr commented Feb 16, 2023

ggwpez commented Feb 16, 2023 • edited Loading

athei commented Feb 16, 2023

bkchr commented Feb 16, 2023

athei commented Feb 16, 2023

rcny commented Feb 17, 2023

alvicsam commented Feb 17, 2023

koute commented Feb 17, 2023

Proposal

mordamax commented Feb 17, 2023 • edited Loading

bkchr commented Feb 17, 2023

athei commented Feb 20, 2023

ggwpez commented Feb 20, 2023 • edited Loading

athei commented Feb 20, 2023

mordamax commented Feb 21, 2023 • edited Loading

alvicsam commented Feb 21, 2023

bkchr commented Feb 22, 2023

alvicsam commented Feb 22, 2023

alvicsam commented Feb 22, 2023

bkchr commented Feb 22, 2023

alvicsam commented Feb 23, 2023

bkchr commented Feb 23, 2023

Lohann commented Aug 8, 2023

ggwpez commented Aug 8, 2023 • edited Loading

Lohann commented Aug 8, 2023

mateo-moon commented Aug 8, 2023

mateo-moon commented Aug 8, 2023 • edited Loading

Lohann commented Aug 10, 2023 • edited Loading

ggwpez commented Aug 10, 2023

mateo-moon commented Aug 10, 2023

Update hardware requirements for `benchmark machine` #13308

Update hardware requirements for `benchmark machine` #13308

ggwpez commented Feb 3, 2023 •

edited

Loading

99Kies commented Feb 4, 2023 •

edited

Loading

koute commented Feb 16, 2023 •

edited

Loading

rcny commented Feb 16, 2023 •

edited

Loading

ggwpez commented Feb 16, 2023 •

edited

Loading

mordamax commented Feb 17, 2023 •

edited

Loading

ggwpez commented Feb 20, 2023 •

edited

Loading

mordamax commented Feb 21, 2023 •

edited

Loading

ggwpez commented Aug 8, 2023 •

edited

Loading

mateo-moon commented Aug 8, 2023 •

edited

Loading

Lohann commented Aug 10, 2023 •

edited

Loading