remote_cache/digest: add benchmark for sha256-simd #4547

sluongng · 2023-08-14T15:39:11Z

Provide a setup to compare minio/sha256-simd (Apache 2.0 license)
performance vs the Go standard library "crypto/sha256".

The sha256-simd library comes with 2 modes:

without server, automatically detect CPU features
with server, require Avx512 CPU features

The ARM64 support is not tested.

Running the benchmark against out remote executor yields

==================== Test output for //server/remote_cache/digest:simd_bench_test:
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz

BenchmarkSIMDDigestCompute/without_SIMD/1-30                      255240              5042 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1-30               234526              5190 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1-30                388          36804140 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10-30                      10000            118668 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10-30               26204             64872 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10-30               100          62445228 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100-30                     10000            193471 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100-30              20247            135334 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100-30              100          64685802 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000-30                    14314            188163 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000-30             10000            176901 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000-30             100         212289431 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10000-30                    9067            658089 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10000-30            10000            721403 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10000-30            100         234613900 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100000-30                   2685           1577976 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100000-30            1924           1079974 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100000-30           100         146595705 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000000-30                   312           9117083 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000-30            298          13086220 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000-30           56         211401036 ns/op

PASS
================================================================================

Related issues: N/A

Provide a setup to compare minio/sha256-simd (Apache 2.0 license) performance vs the Go standard library "crypto/sha256". The `sha256-simd` library comes with 2 modes: - without server, automatically detect CPU features - with server, require Avx512 CPU features The ARM64 support is not tested. Running the benchmark against out remote executor yields ``` ==================== Test output for //server/remote_cache/digest:simd_bench_test: goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) CPU @ 3.10GHz BenchmarkSIMDDigestCompute/without_SIMD/1-30 255240 5042 ns/op BenchmarkSIMDDigestCompute/with_SIMD_no_server/1-30 234526 5190 ns/op BenchmarkSIMDDigestCompute/with_SIMD_with_server/1-30 388 36804140 ns/op BenchmarkSIMDDigestCompute/without_SIMD/10-30 10000 118668 ns/op BenchmarkSIMDDigestCompute/with_SIMD_no_server/10-30 26204 64872 ns/op BenchmarkSIMDDigestCompute/with_SIMD_with_server/10-30 100 62445228 ns/op BenchmarkSIMDDigestCompute/without_SIMD/100-30 10000 193471 ns/op BenchmarkSIMDDigestCompute/with_SIMD_no_server/100-30 20247 135334 ns/op BenchmarkSIMDDigestCompute/with_SIMD_with_server/100-30 100 64685802 ns/op BenchmarkSIMDDigestCompute/without_SIMD/1000-30 14314 188163 ns/op BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000-30 10000 176901 ns/op BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000-30 100 212289431 ns/op BenchmarkSIMDDigestCompute/without_SIMD/10000-30 9067 658089 ns/op BenchmarkSIMDDigestCompute/with_SIMD_no_server/10000-30 10000 721403 ns/op BenchmarkSIMDDigestCompute/with_SIMD_with_server/10000-30 100 234613900 ns/op BenchmarkSIMDDigestCompute/without_SIMD/100000-30 2685 1577976 ns/op BenchmarkSIMDDigestCompute/with_SIMD_no_server/100000-30 1924 1079974 ns/op BenchmarkSIMDDigestCompute/with_SIMD_with_server/100000-30 100 146595705 ns/op BenchmarkSIMDDigestCompute/without_SIMD/1000000-30 312 9117083 ns/op BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000-30 298 13086220 ns/op BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000-30 56 211401036 ns/op PASS ================================================================================ ```

bduffany · 2023-08-14T15:42:38Z

server/remote_cache/digest/digest_amd64_test.go

+func hasherWithServer() hash.Hash {
+	server := sha256simd.NewAvx512Server()
+	return sha256simd.NewAvx512(server)
+}


in real usage, would we reuse this server across requests? wonder if the server should be declared as a top-level var instead of creating a new server on every iteration

https://github.com/minio/sha256-simd/blob/master/README.md#support-for-avx512

Due to this different way of scheduling, we decided to use an explicit method to instantiate the AVX512 version. > Essentially one or more AVX512 processing servers (Avx512Server) have to be created whereby each server can hash over 3 GB/s on a single core. An hash.Hash object (Avx512Digest) is then instantiated using one of these servers and used in the regular fashion:

I think the expectation here is to create 1 server for each core? there are not a lot of examples 🤔

server/remote_cache/digest/digest_amd64_test.go

sluongng · 2023-08-14T16:26:41Z

==================== Test output for //server/remote_cache/digest:simd_bench_test:
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
BenchmarkSIMDDigestCompute/without_SIMD/1000000-30                          2180            556758 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000-30                   2282            674404 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000-30                  571           2252739 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/10000000-30                          121           8813178 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/10000000-30                   129           8296944 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/10000000-30                  50          20486731 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/100000000-30                          14          78125744 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/100000000-30                   13          80598781 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/100000000-30                  7         177869732 ns/op

BenchmarkSIMDDigestCompute/without_SIMD/1000000000-30                          1        4463899572 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_no_server/1000000000-30                   1        1478056796 ns/op
BenchmarkSIMDDigestCompute/with_SIMD_with_server/1000000000-30                 1        1513101099 ns/op
================================================================================

Since the doc mentioned speed up for cases >1MB, I tried to run the test against some larger loads.

Overall, the constraint of (1) a bigger size message, (2) server - CPU core 1-1 mapping, and (3) message padding for alignment make it quite unattractive. to our use case.

Gona close this for now.

sluongng · 2023-08-16T11:10:00Z

After digging into this a bit more, it seems like the CPU we have on GCP, at least for our executor, do not include Intel's SHA extension

Name: Intel(R) Xeon(R) CPU @ 3.10GHz
PhysicalCores: 15
ThreadsPerCore: 2
LogicalCores: 30
Family 6 Model: 85 Vendor ID: Intel
Features: ADX,AESNI,AVX,AVX2,AVX512BW,AVX512CD,AVX512DQ,AVX512F,AVX512VL,AVX512VNNI,BMI1,BMI2,CLMUL,CMOV,CMPXCHG8,CX16,ERMS,F16C,FMA3,FXSR,FXSROPT,HLE,HTT,HYPERVISOR,IA32_ARCH_CAP,IBPB,LAHF,LZCNT,MD_CLEAR,MMX,MOVBE,MPX,NX,OSXSAVE,POPCNT,RDRAND,RDSEED,RDTSCP,RTM,SPEC_CTRL_SSBD,SSE,SSE2,SSE3,SSE4,SSE42,SSSE3,STIBP,SYSCALL,SYSEE,VMX,X87,XGETBV1,XSAVE,XSAVEC,XSAVEOPT,XSAVES
Cacheline bytes: 64
L1 Data Cache: 32768 bytes
L1 Instruction Cache: 32768 bytes
L2 Cache: 1048576 bytes
L3 Cache: 25952256 bytes
Frequency 3100000000 hz

And the minio/sha256-simd code has this clause https://github.com/minio/sha256-simd/blob/6096f891a77bfe490cbea7a424c821b5fdb92849/cpuid_other.go#L27

So when we use sha256simd.New(), that is essentially a thin wrap around crypto/sha256, and thus, the result made no difference. If we ever made a switch to AMD Ryzen / Epyc, we could test this again.

The AVX512 implementation is mostly targeted toward hashing bigger files/messages and thus is not suitable for our use case for now.

The ARM64 implementation could be attractive for ARM64 executors (Linux / MacOS) down the line, but my benchmark on M1 laptop does not show a big speed-up.

Pushed my latest local setup to the branch so future me / other folks could replicate the experiment.

bduffany reviewed Aug 14, 2023

View reviewed changes

server/remote_cache/digest/digest_amd64_test.go Outdated Show resolved Hide resolved

Adjust benchmark for larger message

ae40842

sluongng closed this Aug 14, 2023

sluongng mentioned this pull request Aug 25, 2023

go: upgrade to 1.21 #4516

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote_cache/digest: add benchmark for sha256-simd #4547

remote_cache/digest: add benchmark for sha256-simd #4547

sluongng commented Aug 14, 2023

bduffany Aug 14, 2023 •

edited

Loading

sluongng Aug 14, 2023 •

edited

Loading

sluongng commented Aug 14, 2023

sluongng commented Aug 16, 2023

remote_cache/digest: add benchmark for sha256-simd #4547

remote_cache/digest: add benchmark for sha256-simd #4547

Conversation

sluongng commented Aug 14, 2023

bduffany Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

sluongng Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

sluongng commented Aug 14, 2023

sluongng commented Aug 16, 2023

bduffany Aug 14, 2023 •

edited

Loading

sluongng Aug 14, 2023 •

edited

Loading