Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cds: add a benchmark #14167

Closed
wants to merge 16 commits into from
Closed

cds: add a benchmark #14167

wants to merge 16 commits into from

Conversation

pgenera
Copy link
Contributor

@pgenera pgenera commented Nov 24, 2020

Commit Message: Add a benchmark test for Cluster Discovery Service.
Additional Description: This copies eds_speed_test and modifies it to exercise CDS instead, with some small changes in benchmark options.
Risk Level: Low - test only change.
Testing: Test runs & passes in both _benchmark_test & bazel run cds_speed_test variants.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Fixes #14005

Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
@pgenera
Copy link
Contributor Author

pgenera commented Nov 24, 2020

clang-tidy failure is:

Value stored to '_' during its initialization is never read

which is, uh, the point?

@pgenera pgenera marked this pull request as ready for review November 24, 2020 21:16
@zuercher
Copy link
Member

We've added // NOLINT to similar lines in other benchmark tests. I think it's fine to do the same here.

@zuercher
Copy link
Member

/assign @jmarantz

Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me though I'm not an expert in the cluster-init stuff so I may have missed something.

Just a few questions which could be resolved by adding comments or small changes.

Envoy::Upstream::CdsSpeedTest speed_test(state, false);
uint32_t clusters = skipExpensiveBenchmarks() ? 1 : state.range(0);

speed_test.clusterHelper(true, clusters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment a bit on a few choices here:

  1. Why call speed_test.clusterHelper(true, clusters); twice?
  2. Why not construct CdsSpeedTest outside the loop?
  3. Related: should we put the Envoy::Logger::Context into the class to avoid repeating that code for each test?

I'm not pushing back against those decisions I'm just trying to understand them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. that's the difference between the two test cases; I added a comment (hopefully) explaining.
  2. I assume the loop here is modifying state in a meaningful way, which is then passed to the newly constructed speed_test. This auto _ : state is directly from the example in https://github.com/google/benchmark
  3. y'know, I don't know what the Logger Context was doing here, and removing it has no ill effect...

This was mostly copy & pasted from eds_speed_test (I don't understand github's copy detection and why it didn't notice this); I've backported the changes to eds_speed_test as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE 1 and 3: cool, all set, thanks.

Re 2: the auto _ state : state pattern is fine, but I think state.range(x) won't change within a call, and it would be safe to construct the CdsSpeedTest object outside the loop. I think that will result in a more focused perf benchmark. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing bears out your hypothesis, I see what I missed there.

Envoy::Upstream::CdsSpeedTest speed_test(state, state.range(0));
// if we've been instructed to skip tests, only run once no matter the argument:
uint32_t clusters = skipExpensiveBenchmarks() ? 1 : state.range(2);
speed_test.clusterHelper(state.range(1), clusters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also include the time it took to build the response proto in the benchmark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so; the calls to state_.PauseTiming() and ResumeTiming() in clusterHelper should exclude everything but the grpc callback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be clearer I think what we were benchmarking if we put the PauseTiming calls at the start of the loop here (line 162) and Resume at the end, and then put Resume/Pause calls in the helper functions surrounding the code we really want to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what you mean.

Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
@pgenera
Copy link
Contributor Author

pgenera commented Dec 2, 2020

I (somewhat unrelatedly) made the floor iteration counts match between cds & eds_speed_test in the last commit; below 100 or so iterations the constant factor dominates.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! lgtm; just a couple of nits.

test/common/upstream/cds_speed_test.cc Outdated Show resolved Hide resolved
test/common/upstream/cds_speed_test.cc Outdated Show resolved Hide resolved
Signed-off-by: Phil Genera <pgenera@google.com>
jmarantz
jmarantz previously approved these changes Dec 3, 2020
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks phil!

@jmarantz
Copy link
Contributor

jmarantz commented Dec 3, 2020

Probably it makes sense for some @envoyproxy/senior-maintainers to take a look -- there may be some subtlety in the interaction with the CDS system.

Phil -- did you get some interesting results with the test?

@pgenera
Copy link
Contributor Author

pgenera commented Dec 3, 2020

No smoking gun, unfortunately, just things we know: v3 is faster than v2, ignoring unknown fields is faster than validating them. Unlike EDS, there's no obvious n^2 or worse behavior.

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
addClusters/0/0/64          0.181 ms        0.182 ms         3855
addClusters/1/0/64          0.691 ms        0.691 ms         1010
addClusters/0/1/64          0.092 ms        0.092 ms         7674
addClusters/1/1/64          0.218 ms        0.219 ms         3146
addClusters/0/0/512          1.36 ms         1.36 ms          514
addClusters/1/0/512          5.55 ms         5.55 ms          125
addClusters/0/1/512         0.626 ms        0.626 ms         1114
addClusters/1/1/512          1.78 ms         1.78 ms          394
addClusters/0/0/4096         11.7 ms         11.7 ms           61
addClusters/1/0/4096         45.9 ms         45.9 ms           15
addClusters/0/1/4096         5.66 ms         5.66 ms          125
addClusters/1/1/4096         15.5 ms         15.5 ms           46
addClusters/0/0/32768        99.6 ms         99.6 ms            6
addClusters/1/0/32768         368 ms          368 ms            2
addClusters/0/1/32768        52.6 ms         52.6 ms           13
addClusters/1/1/32768         127 ms          127 ms            5
addClusters/0/0/100000        300 ms          300 ms            2
addClusters/1/0/100000       1127 ms         1127 ms            1
addClusters/0/1/100000        159 ms          159 ms            4
addClusters/1/1/100000        389 ms          389 ms            2
addClusters_BigO          4934.77 N       4934.10 N    
addClusters_RMS               130 %           130 %    
duplicateUpdate/64          0.187 ms        0.187 ms         3832
duplicateUpdate/512          1.31 ms         1.31 ms          534
duplicateUpdate/4096         11.6 ms         11.6 ms           62
duplicateUpdate/32768         107 ms          107 ms            7
duplicateUpdate/100000        316 ms          316 ms            2
duplicateUpdate_BigO      3172.78 N       3172.41 N    
duplicateUpdate_RMS             2 %             2 %    

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! Do you have any perf profiles and flamegraphs when running some of the longer benchmarks?
/wait

test/common/upstream/cds_speed_test.cc Outdated Show resolved Hide resolved
test/common/upstream/cds_speed_test.cc Outdated Show resolved Hide resolved
resetCluster(R"EOF(
name: name
connect_timeout: 0.25s
type: EDS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of Cluster configuration, I think there are ultimately many dimensions to consider, but the two I'd start with are number of clusters, number of endpoints per cluster and the choice of load balancing algorithm (everything from WRR to Maglev etc.). This will provide a pretty interesting microbenchmark.

Copy link
Contributor Author

@pgenera pgenera Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this exercises the number of clusters for both the v2 & v3 APIs; I propose taking out the v2 part and adding LB algorithm & endpoints per cluster, all for STATIC clusters. Does that sound like a good next revision?

In the interim I've separated out the v2 & v3 tests to make the output more readable and give the Complexity Computing Magic a better chance at working.

I haven't generated interesting output, perf or otherwise, as I just got this working again. I'm happy to do so once we know I'm testing the right things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's drop v2 as it's on the way out. Agree on the changes proposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think we could also explore LB algorithm and endpoints in the EDS test if it was able to fully simulate what is going on in eds_speed_test.cc, so maybe we can make CDS test even simpler and exercise some other cluster attributes in the parameter space. For now just dropping v2 sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped v2; maybe adding a larger set of test parameters to this or cds should go in a followup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the additions in followup PRs.

Signed-off-by: Phil Genera <pgenera@google.com>
@pgenera
Copy link
Contributor Author

pgenera commented Dec 7, 2020

Here's the current output, with warnings about v2 API deprecation elided:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
addV3Clusters/0/64          0.780 ms        0.778 ms          916
addV3Clusters/1/64          0.228 ms        0.226 ms         3083
addV3Clusters/0/512          5.99 ms         5.99 ms          119
addV3Clusters/1/512          1.85 ms         1.85 ms          381
addV3Clusters/0/4096         47.0 ms         47.0 ms           15
addV3Clusters/1/4096         14.0 ms         14.0 ms           51
addV3Clusters/0/32768         375 ms          374 ms            2
addV3Clusters/1/32768         103 ms          103 ms            7
addV3Clusters/0/100000       1161 ms         1161 ms            1
addV3Clusters/1/100000        314 ms          314 ms            2
addV3Clusters_BigO        7369.25 N       7368.19 N    
addV3Clusters_RMS              98 %            98 %    

addV2Clusters/0/64           3.62 ms         3.62 ms          192
addV2Clusters/1/64          0.825 ms        0.824 ms          858
addV2Clusters/0/512          27.6 ms         27.5 ms           25
addV2Clusters/1/512          6.56 ms         6.56 ms          108
addV2Clusters/0/4096          219 ms          219 ms            3
addV2Clusters/1/4096         52.0 ms         52.0 ms           14
addV2Clusters/0/32768        1797 ms         1797 ms            1
addV2Clusters/1/32768         405 ms          405 ms            2
addV2Clusters/1/100000       1227 ms         1227 ms            1
addV2Clusters_BigO       32975.49 N      32970.82 N    
addV2Clusters_RMS             107 %           107 %    

duplicateUpdate/64          0.544 ms        0.542 ms         1346
duplicateUpdate/512          4.13 ms         4.12 ms          167
duplicateUpdate/4096         33.7 ms         33.7 ms           21
duplicateUpdate/32768         244 ms          244 ms            3
duplicateUpdate/100000        775 ms          775 ms            1
duplicateUpdate_BigO      7724.84 N       7723.78 N    
duplicateUpdate_RMS             2 %             2 %    

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, waiting for v2 removal.

test/common/upstream/eds_speed_test.cc Show resolved Hide resolved
@htuch htuch added the waiting label Dec 8, 2020
Signed-off-by: Phil Genera <pgenera@google.com>
htuch
htuch previously approved these changes Dec 10, 2020
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. I think the CDS/EDS tests could ultimately be refactoring to make a framework for more general xDS benchmarks, but we can live by Rule of Three in this PR.

@htuch
Copy link
Member

htuch commented Dec 10, 2020

@pgenera the ASAN failure looks legit.

@pgenera
Copy link
Contributor Author

pgenera commented Dec 10, 2020

I've managed to get a flame graph (SVG cannot be attached here, so have a gist) out, via the following:

perf record -g ~/git/envoy/bazel-bin/test/common/upstream/cds_speed_test -- --benchmark_filter="addClusters/1/100000" --benchmark_repitions=10
perf script | ~/FlameGraph/stackcollapse-perf.pl  > out.perf
cat out.perf | ~/FlameGraph/flamegraph.pl > cds.svg

Unfortunately its dominated by making protos for the test, rather than the portion that we're actively measuring.

I'll take a look at the asan failure shortly.

@htuch
Copy link
Member

htuch commented Dec 10, 2020

Thanks @pgenera. Almost all the time is in buildStaticCluster, which is not in the measured benchmark portion but still exits in the profile. I wonder if we can build all clusters once and then do a ton of updates to avoid having this be excessive. This will allow for actionable use of the benchmark to dive into the CDS overheads via profiles.

@pgenera
Copy link
Contributor Author

pgenera commented Dec 10, 2020

I was going to explore filtering the perf output, but yeah, that's annoying (it also explains why this test takes so long to run despite the timings being quite small).

Separately I wonder if these cluster creation helpers should create protos directly instead of parsing YAML.

@jmarantz
Copy link
Contributor

Yeah the flame-graph you shared is dominated by the protobuf parsing library which is not consistent with what we see from captures from live servers.

Probably the actual protocol marshall/unmarshall is much faster than yaml parsing.

@htuch
Copy link
Member

htuch commented Dec 10, 2020

Separately I wonder if these cluster creation helpers should create protos directly instead of parsing YAML.

This would be optimal. I'd just port the buildStaticCluster logic to your benchmark, it's not doing that much. That utility is intended for integration testing (and clarity via YAML) not for performance. I'd probably recommend against using it anyway, since performance and testing have divergent requirements.

@jmarantz
Copy link
Contributor

That sounds good. Would it be sufficient though to just turn off the profile collection during the yaml parsing,. assuming that's possible?

@pgenera
Copy link
Contributor Author

pgenera commented Dec 11, 2020

I ported buildStaticCluster into the benchmark and got a more interesting flame graph out; I haven't dug deep or done any filtering yet however.

I agree that we're on the cusp of wanting to build something generic for xDS benchmarking.

Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
@pgenera
Copy link
Contributor Author

pgenera commented Dec 11, 2020

Here's the most recent flame graph, now using a more concrete stats implementation; however I still don't see the stats overhead I'd expect.

I don't have any leads on the asan failure, other than having reproduced it:

$ bazel build -c dbg --config=clang-asan //test/common/upstream:eds_speed_test_benchmark_test
...
ERROR: /usr/local/google/home/pgenera/git/envoy/test/common/upstream/BUILD:114:26: Linking of rule '//test/common/upstream:eds_speed_test' failed (Exit 1) clang failed: error executing command /usr/local/google/home/pgenera/.clang9/bin/clang @bazel-out/k8-dbg/bin/test/common/upstream/eds_speed_test-2.params

Use --sandbox_debug to see verbose messages from the sandbox
ld.lld: error: undefined symbol: __muloti4
>>> referenced by int128_have_intrinsic.inc:251 (external/com_google_absl/absl/numeric/int128_have_intrinsic.inc:251)
>>>               bazel-out/k8-dbg/bin/external/com_google_absl/absl/strings/_objs/strings/numbers.pic.o:(absl::operator*(absl::int128, absl::int128))
clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
Target //test/common/upstream:eds_speed_test_benchmark_test failed to build

@pgenera
Copy link
Contributor Author

pgenera commented Dec 11, 2020

Having now found #13973, it builds with --config=asan and -fsanitize=undefined removed from the project bazelrc. Obviously that's not the right solution, but its at least informative.

@jmarantz
Copy link
Contributor

Quick sanity check: did you use "-c opt" for the build you used to collect the flame graph?

@pgenera
Copy link
Contributor Author

pgenera commented Dec 12, 2020 via email

@htuch
Copy link
Member

htuch commented Dec 12, 2020

I don't think it's an issue of stats or not, we're not spending any significant time in

if (cm_.addOrUpdateCluster(cluster, resource.get().version())) {
. This is what dominates in the workload flamegraphs that we want to model. I'm suspecting some artifact of the benchmark, e.g. mocks or maybe the actual config proto, is to blame. I suggest doing a single iteration run with -l trace logging and try figure out what is going on in terms of code execution, i.e. is a single cluster being built as expected.

@jmarantz
Copy link
Contributor

jmarantz commented Dec 13, 2020 via email

@pgenera
Copy link
Contributor Author

pgenera commented Dec 14, 2020

Sounds good, I'll start tearing the mocks out.

@jmarantz
Copy link
Contributor

/wait

@pgenera pgenera marked this pull request as draft January 14, 2021 16:42
@yanavlasov yanavlasov self-assigned this Jan 15, 2021
Base automatically changed from master to main January 15, 2021 23:01
@yanavlasov yanavlasov added the no stalebot Disables stalebot from closing an issue label Feb 5, 2021
@jmarantz jmarantz removed their assignment Oct 18, 2021
@alyssawilk alyssawilk closed this Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no stalebot Disables stalebot from closing an issue waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create benchmark tests for CDS
6 participants