cds: add a benchmark #14167

pgenera · 2020-11-24T20:13:02Z

Commit Message: Add a benchmark test for Cluster Discovery Service.
Additional Description: This copies eds_speed_test and modifies it to exercise CDS instead, with some small changes in benchmark options.
Risk Level: Low - test only change.
Testing: Test runs & passes in both _benchmark_test & bazel run cds_speed_test variants.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Fixes #14005

Signed-off-by: Phil Genera <pgenera@google.com>

pgenera · 2020-11-24T21:16:09Z

clang-tidy failure is:

Value stored to '_' during its initialization is never read

which is, uh, the point?

zuercher · 2020-11-24T23:29:56Z

We've added // NOLINT to similar lines in other benchmark tests. I think it's fine to do the same here.

zuercher · 2020-11-24T23:30:20Z

/assign @jmarantz

Signed-off-by: Phil Genera <pgenera@google.com>

jmarantz

This looks great to me though I'm not an expert in the cluster-init stuff so I may have missed something.

Just a few questions which could be resolved by adding comments or small changes.

jmarantz · 2020-11-25T18:32:21Z

test/common/upstream/cds_speed_test.cc

+    Envoy::Upstream::CdsSpeedTest speed_test(state, false);
+    uint32_t clusters = skipExpensiveBenchmarks() ? 1 : state.range(0);
+
+    speed_test.clusterHelper(true, clusters);


can you comment a bit on a few choices here:

Why call speed_test.clusterHelper(true, clusters); twice?

Why not construct CdsSpeedTest outside the loop?

Related: should we put the Envoy::Logger::Context into the class to avoid repeating that code for each test?

I'm not pushing back against those decisions I'm just trying to understand them.

that's the difference between the two test cases; I added a comment (hopefully) explaining.

I assume the loop here is modifying state in a meaningful way, which is then passed to the newly constructed speed_test. This auto _ : state is directly from the example in https://github.com/google/benchmark

y'know, I don't know what the Logger Context was doing here, and removing it has no ill effect...

This was mostly copy & pasted from eds_speed_test (I don't understand github's copy detection and why it didn't notice this); I've backported the changes to eds_speed_test as well.

RE 1 and 3: cool, all set, thanks.

Re 2: the auto _ state : state pattern is fine, but I think state.range(x) won't change within a call, and it would be safe to construct the CdsSpeedTest object outside the loop. I think that will result in a more focused perf benchmark. WDYT?

Testing bears out your hypothesis, I see what I missed there.

yanavlasov · 2020-12-01T03:22:46Z

test/common/upstream/cds_speed_test.cc

+    Envoy::Upstream::CdsSpeedTest speed_test(state, state.range(0));
+    // if we've been instructed to skip tests, only run once no matter the argument:
+    uint32_t clusters = skipExpensiveBenchmarks() ? 1 : state.range(2);
+    speed_test.clusterHelper(state.range(1), clusters);


Wouldn't this also include the time it took to build the response proto in the benchmark?

I don't believe so; the calls to state_.PauseTiming() and ResumeTiming() in clusterHelper should exclude everything but the grpc callback?

It would be clearer I think what we were benchmarking if we put the PauseTiming calls at the start of the loop here (line 162) and Resume at the end, and then put Resume/Pause calls in the helper functions surrounding the code we really want to test.

I think this is what you mean.

Signed-off-by: Phil Genera <pgenera@google.com>

pgenera · 2020-12-02T19:10:27Z

I (somewhat unrelatedly) made the floor iteration counts match between cds & eds_speed_test in the last commit; below 100 or so iterations the constant factor dominates.

jmarantz

nice! lgtm; just a couple of nits.

test/common/upstream/cds_speed_test.cc

Signed-off-by: Phil Genera <pgenera@google.com>

jmarantz

thanks phil!

jmarantz · 2020-12-03T00:11:31Z

Probably it makes sense for some @envoyproxy/senior-maintainers to take a look -- there may be some subtlety in the interaction with the CDS system.

Phil -- did you get some interesting results with the test?

pgenera · 2020-12-03T01:04:03Z

No smoking gun, unfortunately, just things we know: v3 is faster than v2, ignoring unknown fields is faster than validating them. Unlike EDS, there's no obvious n^2 or worse behavior.

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
addClusters/0/0/64          0.181 ms        0.182 ms         3855
addClusters/1/0/64          0.691 ms        0.691 ms         1010
addClusters/0/1/64          0.092 ms        0.092 ms         7674
addClusters/1/1/64          0.218 ms        0.219 ms         3146
addClusters/0/0/512          1.36 ms         1.36 ms          514
addClusters/1/0/512          5.55 ms         5.55 ms          125
addClusters/0/1/512         0.626 ms        0.626 ms         1114
addClusters/1/1/512          1.78 ms         1.78 ms          394
addClusters/0/0/4096         11.7 ms         11.7 ms           61
addClusters/1/0/4096         45.9 ms         45.9 ms           15
addClusters/0/1/4096         5.66 ms         5.66 ms          125
addClusters/1/1/4096         15.5 ms         15.5 ms           46
addClusters/0/0/32768        99.6 ms         99.6 ms            6
addClusters/1/0/32768         368 ms          368 ms            2
addClusters/0/1/32768        52.6 ms         52.6 ms           13
addClusters/1/1/32768         127 ms          127 ms            5
addClusters/0/0/100000        300 ms          300 ms            2
addClusters/1/0/100000       1127 ms         1127 ms            1
addClusters/0/1/100000        159 ms          159 ms            4
addClusters/1/1/100000        389 ms          389 ms            2
addClusters_BigO          4934.77 N       4934.10 N    
addClusters_RMS               130 %           130 %    
duplicateUpdate/64          0.187 ms        0.187 ms         3832
duplicateUpdate/512          1.31 ms         1.31 ms          534
duplicateUpdate/4096         11.6 ms         11.6 ms           62
duplicateUpdate/32768         107 ms          107 ms            7
duplicateUpdate/100000        316 ms          316 ms            2
duplicateUpdate_BigO      3172.78 N       3172.41 N    
duplicateUpdate_RMS             2 %             2 %

htuch

Thanks for putting this together! Do you have any perf profiles and flamegraphs when running some of the longer benchmarks?
/wait

test/common/upstream/cds_speed_test.cc

htuch · 2020-12-03T01:35:14Z

test/common/upstream/cds_speed_test.cc

+    resetCluster(R"EOF(
+      name: name
+      connect_timeout: 0.25s
+      type: EDS


In terms of Cluster configuration, I think there are ultimately many dimensions to consider, but the two I'd start with are number of clusters, number of endpoints per cluster and the choice of load balancing algorithm (everything from WRR to Maglev etc.). This will provide a pretty interesting microbenchmark.

Right now this exercises the number of clusters for both the v2 & v3 APIs; I propose taking out the v2 part and adding LB algorithm & endpoints per cluster, all for STATIC clusters. Does that sound like a good next revision?

In the interim I've separated out the v2 & v3 tests to make the output more readable and give the Complexity Computing Magic a better chance at working.

I haven't generated interesting output, perf or otherwise, as I just got this working again. I'm happy to do so once we know I'm testing the right things.

Yes, let's drop v2 as it's on the way out. Agree on the changes proposed.

FWIW, I think we could also explore LB algorithm and endpoints in the EDS test if it was able to fully simulate what is going on in eds_speed_test.cc, so maybe we can make CDS test even simpler and exercise some other cluster attributes in the parameter space. For now just dropping v2 sounds good.

I've dropped v2; maybe adding a larger set of test parameters to this or cds should go in a followup?

Let's do the additions in followup PRs.

Signed-off-by: Phil Genera <pgenera@google.com>

pgenera · 2020-12-07T17:35:43Z

Here's the current output, with warnings about v2 API deprecation elided:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
addV3Clusters/0/64          0.780 ms        0.778 ms          916
addV3Clusters/1/64          0.228 ms        0.226 ms         3083
addV3Clusters/0/512          5.99 ms         5.99 ms          119
addV3Clusters/1/512          1.85 ms         1.85 ms          381
addV3Clusters/0/4096         47.0 ms         47.0 ms           15
addV3Clusters/1/4096         14.0 ms         14.0 ms           51
addV3Clusters/0/32768         375 ms          374 ms            2
addV3Clusters/1/32768         103 ms          103 ms            7
addV3Clusters/0/100000       1161 ms         1161 ms            1
addV3Clusters/1/100000        314 ms          314 ms            2
addV3Clusters_BigO        7369.25 N       7368.19 N    
addV3Clusters_RMS              98 %            98 %    

addV2Clusters/0/64           3.62 ms         3.62 ms          192
addV2Clusters/1/64          0.825 ms        0.824 ms          858
addV2Clusters/0/512          27.6 ms         27.5 ms           25
addV2Clusters/1/512          6.56 ms         6.56 ms          108
addV2Clusters/0/4096          219 ms          219 ms            3
addV2Clusters/1/4096         52.0 ms         52.0 ms           14
addV2Clusters/0/32768        1797 ms         1797 ms            1
addV2Clusters/1/32768         405 ms          405 ms            2
addV2Clusters/1/100000       1227 ms         1227 ms            1
addV2Clusters_BigO       32975.49 N      32970.82 N    
addV2Clusters_RMS             107 %           107 %    

duplicateUpdate/64          0.544 ms        0.542 ms         1346
duplicateUpdate/512          4.13 ms         4.12 ms          167
duplicateUpdate/4096         33.7 ms         33.7 ms           21
duplicateUpdate/32768         244 ms          244 ms            3
duplicateUpdate/100000        775 ms          775 ms            1
duplicateUpdate_BigO      7724.84 N       7723.78 N    
duplicateUpdate_RMS             2 %             2 %

htuch

Looks good, waiting for v2 removal.

test/common/upstream/eds_speed_test.cc

Signed-off-by: Phil Genera <pgenera@google.com>

htuch

LGTM, thanks. I think the CDS/EDS tests could ultimately be refactoring to make a framework for more general xDS benchmarks, but we can live by Rule of Three in this PR.

htuch · 2020-12-10T01:18:43Z

@pgenera the ASAN failure looks legit.

pgenera · 2020-12-10T17:49:29Z

I've managed to get a flame graph (SVG cannot be attached here, so have a gist) out, via the following:

perf record -g ~/git/envoy/bazel-bin/test/common/upstream/cds_speed_test -- --benchmark_filter="addClusters/1/100000" --benchmark_repitions=10
perf script | ~/FlameGraph/stackcollapse-perf.pl  > out.perf
cat out.perf | ~/FlameGraph/flamegraph.pl > cds.svg

Unfortunately its dominated by making protos for the test, rather than the portion that we're actively measuring.

I'll take a look at the asan failure shortly.

htuch · 2020-12-10T19:53:11Z

Thanks @pgenera. Almost all the time is in buildStaticCluster, which is not in the measured benchmark portion but still exits in the profile. I wonder if we can build all clusters once and then do a ton of updates to avoid having this be excessive. This will allow for actionable use of the benchmark to dive into the CDS overheads via profiles.

pgenera · 2020-12-10T19:58:13Z

I was going to explore filtering the perf output, but yeah, that's annoying (it also explains why this test takes so long to run despite the timings being quite small).

Separately I wonder if these cluster creation helpers should create protos directly instead of parsing YAML.

jmarantz · 2020-12-10T20:01:46Z

Yeah the flame-graph you shared is dominated by the protobuf parsing library which is not consistent with what we see from captures from live servers.

Probably the actual protocol marshall/unmarshall is much faster than yaml parsing.

htuch · 2020-12-10T20:06:48Z

Separately I wonder if these cluster creation helpers should create protos directly instead of parsing YAML.

This would be optimal. I'd just port the buildStaticCluster logic to your benchmark, it's not doing that much. That utility is intended for integration testing (and clarity via YAML) not for performance. I'd probably recommend against using it anyway, since performance and testing have divergent requirements.

jmarantz · 2020-12-10T20:13:47Z

That sounds good. Would it be sufficient though to just turn off the profile collection during the yaml parsing,. assuming that's possible?

pgenera · 2020-12-11T15:55:23Z

I ported buildStaticCluster into the benchmark and got a more interesting flame graph out; I haven't dug deep or done any filtering yet however.

I agree that we're on the cusp of wanting to build something generic for xDS benchmarking.

Signed-off-by: Phil Genera <pgenera@google.com>

pgenera · 2020-12-11T20:56:34Z

Here's the most recent flame graph, now using a more concrete stats implementation; however I still don't see the stats overhead I'd expect.

I don't have any leads on the asan failure, other than having reproduced it:

$ bazel build -c dbg --config=clang-asan //test/common/upstream:eds_speed_test_benchmark_test
...
ERROR: /usr/local/google/home/pgenera/git/envoy/test/common/upstream/BUILD:114:26: Linking of rule '//test/common/upstream:eds_speed_test' failed (Exit 1) clang failed: error executing command /usr/local/google/home/pgenera/.clang9/bin/clang @bazel-out/k8-dbg/bin/test/common/upstream/eds_speed_test-2.params

Use --sandbox_debug to see verbose messages from the sandbox
ld.lld: error: undefined symbol: __muloti4
>>> referenced by int128_have_intrinsic.inc:251 (external/com_google_absl/absl/numeric/int128_have_intrinsic.inc:251)
>>>               bazel-out/k8-dbg/bin/external/com_google_absl/absl/strings/_objs/strings/numbers.pic.o:(absl::operator*(absl::int128, absl::int128))
clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
Target //test/common/upstream:eds_speed_test_benchmark_test failed to build

pgenera · 2020-12-11T21:17:58Z

Having now found #13973, it builds with --config=asan and -fsanitize=undefined removed from the project bazelrc. Obviously that's not the right solution, but its at least informative.

jmarantz · 2020-12-11T21:58:05Z

Quick sanity check: did you use "-c opt" for the build you used to collect the flame graph?

pgenera · 2020-12-12T01:51:23Z

Can confirm: $ history | grep -B 4 "perf record" 1920 [2020-12-11 20:06:26 +0000] bazel build -c opt //test/common/upstream:cds_speed_test 1921 [2020-12-11 20:08:55 +0000] perf record -g ./git/envoy/bazel-bin/test/common/upstream/cds_speed_test -- --benchmark_filter="addClusters/1/100000" --benchmark_repitions=100 1922 [2020-12-11 20:08:57 +0000] cd 1923 [2020-12-11 20:08:59 +0000] perf record -g ./git/envoy/bazel-bin/test/common/upstream/cds_speed_test -- --benchmark_filter="addClusters/1/100000" --benchmark_repitions=100

htuch · 2020-12-12T23:36:19Z

I don't think it's an issue of stats or not, we're not spending any significant time in

envoy/source/common/upstream/cds_api_impl.cc

Line 84 in 046b0e3

if (cm_.addOrUpdateCluster(cluster, resource.get().version())) {

. This is what dominates in the workload flamegraphs that we want to model. I'm suspecting some artifact of the benchmark, e.g. mocks or maybe the actual config proto, is to blame. I suggest doing a single iteration run with -l trace logging and try figure out what is going on in terms of code execution, i.e. is a single cluster being built as expected.

jmarantz · 2020-12-13T00:53:21Z

I'm also wondering if it might be useful to start swapping out the mocks for real implementations. One alternative approach is to use the MainCommon class, which is a linkable way to set up all the production classes; no mocks.

…

On Sat, Dec 12, 2020 at 6:36 PM htuch ***@***.***> wrote: I don't think it's an issue of stats or not, we're not spending any significant time in https://github.com/envoyproxy/envoy/blob/046b0e315e04f645aa0d62119a056bc560497ed7/source/common/upstream/cds_api_impl.cc#L84. This is what dominates in the workload flamegraphs that we want to model. I'm suspecting some artifact of the benchmark, e.g. mocks or maybe the actual config proto, is to blame. I suggest doing a single iteration run with -l trace logging and try figure out what is going on in terms of code execution, i.e. is a single cluster being built as expected. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#14167 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAO2IPKTJ24YN36GUUZEAXLSUP5ABANCNFSM4UBLMGAQ> .

pgenera · 2020-12-14T14:11:07Z

Sounds good, I'll start tearing the mocks out.

jmarantz · 2020-12-21T18:41:15Z

/wait

pgenera added 3 commits November 24, 2020 20:01

Create CDS speed test, mostly copied from EDS speed test.

f3b89b0

Signed-off-by: Phil Genera <pgenera@google.com>

mechanical formatting

589c94f

Signed-off-by: Phil Genera <pgenera@google.com>

remove duplicate comments

d55b812

Signed-off-by: Phil Genera <pgenera@google.com>

pgenera marked this pull request as ready for review November 24, 2020 21:16

repokitteh-read-only bot assigned jmarantz Nov 24, 2020

pgenera added 2 commits November 25, 2020 15:06

no lint

5f1834f

Signed-off-by: Phil Genera <pgenera@google.com>

stop spellchecking linter commands quite so much?

ca24a49

Signed-off-by: Phil Genera <pgenera@google.com>

jmarantz reviewed Nov 25, 2020

View reviewed changes

yanavlasov reviewed Dec 1, 2020

View reviewed changes

pgenera added 2 commits December 1, 2020 18:33

respond to review comments

58cf314

Signed-off-by: Phil Genera <pgenera@google.com>

respond to review comments

cc7f54e

Signed-off-by: Phil Genera <pgenera@google.com>

jmarantz reviewed Dec 2, 2020

View reviewed changes

test/common/upstream/cds_speed_test.cc Outdated Show resolved Hide resolved

test/common/upstream/cds_speed_test.cc Outdated Show resolved Hide resolved

respond to review comment

be6a36e

Signed-off-by: Phil Genera <pgenera@google.com>

jmarantz previously approved these changes Dec 3, 2020

View reviewed changes

jmarantz assigned htuch Dec 3, 2020

htuch suggested changes Dec 3, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Dec 3, 2020

respond to review comments; split out v2 tests

db19c30

Signed-off-by: Phil Genera <pgenera@google.com>

pgenera dismissed jmarantz’s stale review via db19c30 December 7, 2020 17:13

repokitteh-read-only bot removed the waiting label Dec 7, 2020

htuch reviewed Dec 8, 2020

View reviewed changes

test/common/upstream/eds_speed_test.cc Show resolved Hide resolved

htuch added the waiting label Dec 8, 2020

remove v2 API benchmark

cd034a9

Signed-off-by: Phil Genera <pgenera@google.com>

htuch mentioned this pull request Dec 10, 2020

CDS Updates with many clusters often fail #12138

Open

htuch previously approved these changes Dec 10, 2020

View reviewed changes

pgenera added 2 commits December 11, 2020 16:22

build clusters natively rather than via yaml

1d3692f

Signed-off-by: Phil Genera <pgenera@google.com>

use a more concrete stats implementation

256b961

Signed-off-by: Phil Genera <pgenera@google.com>

pgenera dismissed htuch’s stale review via 256b961 December 11, 2020 20:54

mattklein123 unassigned htuch Dec 18, 2020

repokitteh-read-only bot added the waiting label Dec 21, 2020

pgenera marked this pull request as draft January 14, 2021 16:42

yanavlasov self-assigned this Jan 15, 2021

Base automatically changed from master to main January 15, 2021 23:01

yanavlasov added the no stalebot Disables stalebot from closing an issue label Feb 5, 2021

yanavlasov mentioned this pull request Mar 2, 2021

upstream: avoid double hashing of protos in CDS init #15241

Merged

jmarantz removed their assignment Oct 18, 2021

alyssawilk closed this Jan 4, 2023

cds: add a benchmark #14167

cds: add a benchmark #14167

Conversation

pgenera commented Nov 24, 2020

pgenera commented Nov 24, 2020

zuercher commented Nov 24, 2020

zuercher commented Nov 24, 2020

jmarantz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgenera commented Dec 2, 2020

jmarantz left a comment

Choose a reason for hiding this comment

jmarantz left a comment

Choose a reason for hiding this comment

jmarantz commented Dec 3, 2020

pgenera commented Dec 3, 2020

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgenera Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgenera commented Dec 7, 2020

htuch left a comment

Choose a reason for hiding this comment

htuch left a comment

Choose a reason for hiding this comment

htuch commented Dec 10, 2020

pgenera commented Dec 10, 2020 • edited Loading

htuch commented Dec 10, 2020 • edited Loading

pgenera commented Dec 10, 2020

jmarantz commented Dec 10, 2020

htuch commented Dec 10, 2020

jmarantz commented Dec 10, 2020

pgenera commented Dec 11, 2020

pgenera commented Dec 11, 2020

pgenera commented Dec 11, 2020

jmarantz commented Dec 11, 2020

pgenera commented Dec 12, 2020 via email • edited Loading

htuch commented Dec 12, 2020

jmarantz commented Dec 13, 2020 via email

pgenera commented Dec 14, 2020

jmarantz commented Dec 21, 2020

pgenera Dec 7, 2020 •

edited

Loading

pgenera commented Dec 10, 2020 •

edited

Loading

htuch commented Dec 10, 2020 •

edited

Loading

pgenera commented Dec 12, 2020 via email •

edited

Loading