Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Update hardware requirements for benchmark machine #13308

Closed
ggwpez opened this issue Feb 3, 2023 · 35 comments · Fixed by #13317
Closed

Update hardware requirements for benchmark machine #13308

ggwpez opened this issue Feb 3, 2023 · 35 comments · Fixed by #13317
Labels
U2-some_time_soon Issue is worth doing soon. Z0-trivial Writing the issue is of the same difficulty as patching the code.

Comments

@ggwpez
Copy link
Member

ggwpez commented Feb 3, 2023

Run the benchmark machine on new reference hardware and update the requirements JSON file. Showed up on SE.

@ggwpez ggwpez added U2-some_time_soon Issue is worth doing soon. Z0-trivial Writing the issue is of the same difficulty as patching the code. labels Feb 3, 2023
@99Kies
Copy link

99Kies commented Feb 4, 2023

What is the configuration now?

@ggwpez
Copy link
Member Author

ggwpez commented Feb 4, 2023

@99Kies
Copy link

99Kies commented Feb 4, 2023

@ggwpez I know this, but benchmark doesn't pass successfully after my machine is configured.
image

@ggwpez
Copy link
Member Author

ggwpez commented Feb 6, 2023

You can find it here #13317

@koute
Copy link
Contributor

koute commented Feb 16, 2023

So we're going to run the benches on cloud VMs right now? Hm, I understand that's what most people are running so in a way it makes sense, although it might make dealing with the benchmarks more painful due to the inherent variability of cloud machines.

Two random possibilities I see which could make this better.

Just use bare metal anyway

Run the benchmarks on bare metal, and manually calculate a static multiplier(s) which would allow us to translate the results from bare hardware into the cloud machine, and have those as official weights.

Use multiple cloud machines in parallel

  1. The benchmark bot is triggered to run benches.
  2. It runs the benches not on one machine, but on X machines in parallel (where X is picked high enough to make them "stable"); every run generates a .json with the results
  3. After they all finish the bot takes all of the results and merges them into one big json, and then generates the weights for that. (I've added the ability to generate the weights from the json file without running the benches so this is mostly done; you'd just need to write some simple program to merge the jsons)

Basically you'd essentially have the benchmarks which average the results from one machine (so it reduces the variability within that single VM), and a "meta" benchmark like this which sits a lever higher and averages results from multiple machines (which reduces the variability across the VMs).

We'd also have to make sure that our VMs are going to be put on separate machines. Not sure if/which cloud providers provide such service, but an easy way to guarantee this would be to just pick the same type of VMs in multiple regions.

@paritytech paritytech deleted a comment from moliqingcha000 Feb 16, 2023
@rcny
Copy link
Contributor

rcny commented Feb 16, 2023

About Use multiple cloud machines in parallel:

  • GCP supports VM instance placement policies
  • This could be done for GitLab almost natively and I think it could be applied after some tweaking around to GHA too

About Just use bare metal anyway:

  • IMO sticking with this setup here gives us less benefits than going cloud all-in because of the required maintenance overhead for these bare metal hosts. The engineering resources' investment that should be put into the approach with calculating the median for cloud VMs would benefit us much more in the end.

Overall, my opinion is that using a fully cloud-based system here would be more beneficial than continuing to use the bare metal setup, and that the resources allocated towards maintaining the current setup would be better used in developing a cloud-based solution.

@bkchr
Copy link
Member

bkchr commented Feb 16, 2023

If we can come to reasonable results using the cloud VMs, they should be fine. However, for this we need to test @koute approach.

@ggwpez
Copy link
Member Author

ggwpez commented Feb 16, 2023

TLDR: FRAME-benchmarking could be the reason and not hardware.

Comparing them cross-runtime is normally a good indication whether or not they are consistent, even on the same machine.
Since if they are not; we can be sure that the benchmark is unstable.

The hardware stats of the VMs are very consistent from my past measurements. That the weights dont follow suit could come from too few repetitions.
Especially for short No-OPs like remark_with_event, we probably have to implement a dynamic number of repetitions.
Currently it is doing about 100 reps for that in the standard config, which is a very short time-frame on the VM.
Instead we can first do a phase where the variance is measured and then calculate how many repetitions it would need to "converge". Now if it does not converge, we could emit an error that the hardware is broken.

@athei
Copy link
Member

athei commented Feb 16, 2023

TLDR: FRAME-benchmarking could be the reason and not hardware.

We should aim for the most reliable solution with the least noise here. So why not both? Improve FRAME and still run on machines with predictable load (bare metal).

@bkchr
Copy link
Member

bkchr commented Feb 16, 2023

@rcny I talked with @athei and we came to the agreement that we should continue running the bare metal machines until either FRAME-benchmarking is working properly or we come up with some different solution. We can not sacrifice our work here because maintaining the bare metal machines is more workload. I trust you with this and we will make sure to find a solution we are all happy with, but until this is done please bring back the bare metal machines.

@athei
Copy link
Member

athei commented Feb 16, 2023

What this means concretely:

  1. Revert the baseline back to bare metal
  2. The bench bot and the manual "bench all" job should use bare metal by default

It is no problem to have the VMs in parallel to try them out. But we should not switch over until everything works for some time.

@rcny
Copy link
Contributor

rcny commented Feb 17, 2023

@alvicsam
Copy link
Contributor

The baseline in substrate wasn't changed. There was an attempt in #13316 but it failed because of inconsistent results for pallet-contracts. I will ask @paritytech/opstooling to adjust bench-bot settings to use bare-metal machine for substrate.

@koute
Copy link
Contributor

koute commented Feb 17, 2023

To me it feels like we have two somewhat conflicting use-cases here:

  1. Benchmark to generate weights.
  2. Benchmark to see whether the performance has changed.

For (1):

  • we want a configuration that's equivalent to what people are running in the wild,
  • we want this blessed hardware/software configuration to not change over long periods of time,
  • we want to build a linear model over the raw data (that is, the weights) so that we can roughly predict how much something's going to cost to run

For (2):

  • we don't necessarily need to match what people are running in the wild (if we make something faster on hardware X then it will in the vast majority of cases also be faster on hardware Y, unless we engage in microarchitectural-level optimization, which we usually don't)
  • hardware which is as fast as possible so that the benchmarks finish as fast as possible and we don't spend a long time waiting
  • short-term rock solid run-to-run stability of the results
  • raw data and other metrics which are appropriate for performance comparisons (which weights are not since those can produce relatively huge run-to-run variances even if the performance didn't change purely due to how they're calculated; yes, I know we've been using them like that up until this point, but this is really suboptimal)
  • we want the ability to see whether a specific PR changes the performance, and only that PR (instead of what we do currently where we often compare the freshly generated weights to the weights which already exist in the repository which effectively benchmarks dozens/hundreds of other PRs besides ours and introduces a lot of noise)

Proposal

So what would make sense to me would be:

  1. Further improve FRAME benchmarking machinery with what @ggwpez proposed, and add more metrics besides weights which are more appropriate for performance comparisons (e.g. just essentially copy-paste what criterion's doing)
  2. Have separate sets of machines for use-case (1) and (2). (I don't really think having to maintain a few bare-metal machines for performance benchmarking is a big deal; if no one wants to do it I'll happily volunteer and maintain them for you. :P)
  3. Have the benchmarking bot support two modes:
    a) Weight generation mode. This would generate weights just as it does now, and run on cloud VMs.
    b) Performance benchmarking mode. You'd trigger this bot in a PR and this would run the benchmark twice on a bare metal machine: one time without the changes from the PR, and another time with the changes from the PR, and then it would compare them using appropriate metrics and display a nice automatic analysis (and this wouldn't generate weights, since, again, those are not really appropriate for this use case); kinda like rustc's perf bot

Another nice thing about (b) would be that we could also hook other non-FRAME benchmarks to that, e.g. some of the tests I usually manually run when we upgrade wasmtime I could just have a bot do them automatically.

@mordamax
Copy link
Contributor

mordamax commented Feb 17, 2023

@alvicsam

I will ask https://github.com/orgs/paritytech/teams/opstooling to adjust bench-bot settings to use bare-metal machine for substrate.

Even though it's relatively easy to implement, I'd first propose alternative:
-> if only frame_benchmarking is the problem, can we re-generate ALL weights to VM baseline, and only frame_benchmarking leave on bm3 (using workaround bot bench -v PIPELINE_SCRIPTS_REF=bm-fallback $ pallet dev frame_benchmarking) until it's fixed


@koute

Have the benchmarking bot support two modes

sounds reasonable & for sure we can make both modes work through command bot.
For example:
bot weights -> all pallets on VM + commit new weights
bot bench $ pallet dev frame_benchmarking -> bm + as a result output a link to see a diff

@bkchr
Copy link
Member

bkchr commented Feb 17, 2023

if only frame_benchmarking is the problem, can we re-generate ALL weights to VM baseline

No it isn't about the crate frame-benchmarking. We referred to the runtime benchmarking as "frame-benchmarking".

@athei
Copy link
Member

athei commented Feb 20, 2023

Even if we had those two sets of benchmarking machines: It will still be a problem if the weight machines have a lot of noise. I think we will still have a problem merging those diffs even if the "performance machines" show no diff. Also, it will make the setup more complicated.

I think we should first try improve the benchmarking machinery and see if we can get rid of the noise as suggested by @ggwpez. Only if this fails we should look into further steps. But until then we need to be unblocked and continue using the old machines.

@ggwpez
Copy link
Member Author

ggwpez commented Feb 20, 2023

But until then we need to be unblocked and continue using the old machines

AFAIK you are unblocked and can continue to use the old machines as described above by @mordamax. The Substrate weights are not yet updated @athei

@athei
Copy link
Member

athei commented Feb 20, 2023

Yes. I am using a "special" bench bot command. But it should be the default. Because anyone who doesn't know about this will be caught off guard scratching their heads about the diff.

@mordamax
Copy link
Contributor

mordamax commented Feb 21, 2023

Yes. I am using a "special" bench bot command. But it should be the default. Because anyone who doesn't know about this will be caught off guard scratching their heads about the diff.

@athei @ggwpez @bkchr @koute What would be the long-term plan re VM vs BM? if it's not yet clear, than I'd suggest to use this "special" :) bench-bm command, which runs benchmarks on BM in a meanwhile, until we have an understanding VM, VM + BM or anything else :)
once it's more or less clear, i'll work on more "default"/standard way to deal with it so it's also clear for users and there's no confusion around this

@alvicsam
Copy link
Contributor

As I understood from the thread it's possible to adjust benchmarks in substrate to run them in VM runners but it requires additional coding described in #13308 (comment). So until this is addressed substrate is going to use BM runners.
FYI Cumulus and Polkadot are already using new VM runners without issues.

@bkchr
Copy link
Member

bkchr commented Feb 22, 2023

@athei @ggwpez @bkchr @koute What would be the long-term plan re VM vs BM?

Quoting what @athei said above:

I think we should first try improve the benchmarking machinery and see if we can get rid of the noise as suggested by @ggwpez. Only if this fails we should look into further steps. But until then we need to be unblocked and continue using the old machines.

if it's not yet clear, than I'd suggest to use this "special" :) bench-bm command, which runs benchmarks on BM in a meanwhile, until we have an understanding VM, VM + BM or anything else :)

I would propose to do exactly what @athei proposed above your post. The default is to use BM and then have some special command for VM.

FYI Cumulus and Polkadot are already using new VM runners without issues.

The benchmarking works, but produces completely different results all the time, because the VM has a lot of noise. We also already have seen this in Cumulus benchmarks.

@alvicsam
Copy link
Contributor

The benchmarking works, but produces completely different results all the time, because the VM has a lot of noise. We also already have seen this in Cumulus benchmarks.

Oh, that sucks and should be investigated :( Can you give some examples please?

@alvicsam
Copy link
Contributor

I've found, for the reference:

paritytech/cumulus#2199
paritytech/cumulus#2198

@bkchr
Copy link
Member

bkchr commented Feb 22, 2023

Oh, that sucks and should be investigated :(

Yes and that is also what we already have discussed here in the issue. We will investigate and hope that paritytech/polkadot-sdk#379 fixes it. However, until we know the proper fix we want to continue with the current machines.

@alvicsam
Copy link
Contributor

BM machines and weights were reverted in paritytech/cumulus#2225 and paritytech/polkadot#6762. Bench bot is now using bm machine by default (reverted in paritytech/command-bot-scripts#14)

@bkchr
Copy link
Member

bkchr commented Feb 23, 2023

Thank you @alvicsam

@Lohann
Copy link
Contributor

Lohann commented Aug 8, 2023

What bare-metal instance are substrate using for the benchmarks? I would like to run the benchmarks independently to see if I can get the same results.

@ggwpez
Copy link
Member Author

ggwpez commented Aug 8, 2023

Substrate switched to VM servers as well. In theory the Wiki entry should be enough info to reproduce it.

If you want to exactly reproduce our CI setup, @oleg-plakida will have confirm it.
Personally I am using these args for a gcloud instance. Is that exactly the same as the bot, or what is the bot using @oleg-plakida?

--google-threads-per-core=1 \
--google-machine-type=n2-standard-8 \
--google-min-cpu-platform="Intel Ice Lake" \
--google-zone=europe-west1-b \
--google-machine-image="debian-11-bullseye" \
--google-disk-size=30 \
--google-disk-type='pd-ssd' \
--google-local-ssd=4 \
--engine-opt "data-root=/mnt/local/docker" \
...

@Lohann
Copy link
Contributor

Lohann commented Aug 8, 2023

Ok I'm confusing, @alvicsam said here that polkadot/cumulus reverted to bare-metal instances, so I was wondering what bm instances you guys are using, I know before it was Open OVH Rise 2.

BM machines and weights were reverted in paritytech/cumulus#2225 and paritytech/polkadot#6762. Bench bot is now using bm machine by default (reverted in paritytech/command-bot-scripts#14)

So now it is using Google's VM's again?

@mateo-moon
Copy link
Contributor

Ok I'm confusing, @alvicsam said here that polkadot/cumulus reverted to bare-metal instances, so I was wondering what bm instances you guys are using, I know before it was Open OVH Rise 2.

BM machines and weights were reverted in paritytech/cumulus#2225 and paritytech/polkadot#6762. Bench bot is now using bm machine by default (reverted in paritytech/command-bot-scripts#14)

So now it is using Google's VM's again?

Hi, Lohann. There have been a rough road to VM with some back and force migrations) But since about month ago all repos had been switched to VM. Also ‘bot bench’ command uses VM machines by default. You can refer to Oliver’s comment if you need to create testing env of your own. Please use the how-to https://github.com/paritytech/ci_cd/wiki/Gitlab:-Setup-new--runner-(bare-metal-and-VM)#create-runner-manually

@mateo-moon
Copy link
Contributor

mateo-moon commented Aug 8, 2023

Substrate switched to VM servers as well. In theory the Wiki entry should be enough info to reproduce it.

If you want to exactly reproduce our CI setup, @oleg-plakida will have confirm it. Personally I am using these args for a gcloud instance. Is that exactly the same as the bot, or what is the bot using @oleg-plakida?

--google-threads-per-core=1 \
--google-machine-type=n2-standard-8 \
--google-min-cpu-platform="Intel Ice Lake" \
--google-zone=europe-west1-b \
--google-machine-image="debian-11-bullseye" \
--google-disk-size=30 \
--google-disk-type='pd-ssd' \
--google-local-ssd=4 \
--engine-opt "data-root=/mnt/local/docker" \
...

Yap. Mostly the same, except of some CI parameters which could be omitted. One small notice, i would recommend use the most fresh machine image:
--google-machine-image="parity-build/global/images/gitlab-runner-debian-11-bullseye-v20221206-v8" but it should be optional.

@Lohann
Copy link
Contributor

Lohann commented Aug 10, 2023

@oleg-plakida thank you so much for your help, btw this repo is private for me:
https://github.com/paritytech/ci_cd/wiki/Gitlab:-Setup-new--runner-(bare-metal-and-VM)#create-runner-manually

One thing is not clear is how you guys solved the issues regarding VM's results, are substrate implementing the @koute multiple cloud machines in parallel for get an average weight? or does it simply runs in one VM?

@ggwpez
Copy link
Member Author

ggwpez commented Aug 10, 2023

The benchmark results were consistent enough to not induce any further action. We just run it once on a VM and then use the results. As far as I am aware, no obvious inconsistencies reported.

@mateo-moon
Copy link
Contributor

@oleg-plakida thank you so much for your help, btw this repo is private for me: https://github.com/paritytech/ci_cd/wiki/Gitlab:-Setup-new--runner-(bare-metal-and-VM)#create-runner-manually

One thing is not clear is how you guys solved the issues regarding VM's results, are substrate implementing the @koute multiple cloud machines in parallel for get an average weight? or does it simply runs in one VM?

Oliver is right. Regarding access it’s only for parity team. You can contact me in matrix and i will find the way to pass this docs and custom binary of docker-machine to you.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
U2-some_time_soon Issue is worth doing soon. Z0-trivial Writing the issue is of the same difficulty as patching the code.
Projects
Development

Successfully merging a pull request may close this issue.

10 participants