perf: RPC Serialization #7519

vmg · 2021-02-19T12:12:58Z

Description

Hellooooo computer fiends. Coming up next in my series of performance improvements, we're tracking RPC performance, starting with the lowest hanging fruit: serialization. Right now all the RPC communication for Vitess happens through GRPC (between vtgate and the tablets, for the admin API, etc), and GRPC kind of forces us towards using Protocol Buffers for (de)serialization. This is actually not a hard requirement on the GRPC API, as it supports custom serializers, but this is not a particularly well tested interface, so in the spirit of incremental, non-breaking changes, let us not go down that road (at least for the time being).

So, since we're stuck with Protobuf, what options do we have to improve its performance? The most obvious answer is gogo/protobuf. For those who've never heard of the Gogo generator, it's an alternative implementation of the protoc-gen-go plugin that comes with Google's go/protobuf package. The new generator supports many new features, the most relevant (for us) being the full unfolding of the generated serialization and deserialization code for all our Protobuf messages. The default implementation of protoc-gen-go generates code that uses reflection at runtime to (de)serialize messages, and that is a very significant performance overhead.

By simply switching the generator for protobuf files from Google's to the gogo fork, we can scrape a significant amount of performance when (de)serializing Protobuf messages from/to the wire in GRPC without changing at all the actual shape of the messages on the wire (i.e. this is a fully transparent and backwards compatible change).

Now, relevant questions:

If this is a magical performance improvement, why isn't everybody using it?

Well, they kinda are. It turns out that Vitess is the only major Go distributed system that is using the default Google Protobuf generator. Everybody else (Kubernetes, etcd, Prometheus, etc...) has switched to it a while ago. There is, however, a new development in the Go/Protobufs/GRPC ecosystem that is worth taking into account: last year, Google released a new Protobuf package for Go with a new, non-backwards compatible API. I have, huh, strong feelings about the design of the package (which I may summarize as "it is a bit of a tire fire"), and it seems like the community consensus is similar, because no major software project has switched to the new package's APIs. The new Protobuf v2 package is fully identical on-the-wire, but it changes the Go APIs for (de)serialization, particularly by allowing many dynamic reflection features for PB messages that were hard to do with the old API. Obviously, Vitess does not do any reflection (or support dynamic protobuf messages at all), and neither do most of the major Go projects using ProtoBuf right now, something which is probably causing the glacially slow (nonexistent?) adoption of the new package.

Most relevantly to the this discussion: Go's Protobuf V2 package has enough breaking changes that it doesn't look like the Gogo Protobuf generator is going to support it anytime soon. Hence, if we switch to the new generator (something which, as you can see on this PR's code changes, is rather trivial), if we ever decide to upgrade to the APIv2 for the Protobuf package, we'd essentially lose all the performance improvements we're gaining here.

Are we likely to switch to Protobuf's APIv2 in the future?

I don't think so? As explained early, APIv1 and APIv2 on the Protobuf packages are identical on the wire. All that has changed are the actual APIs in the Go side of things to give better support for dynamic message reflection. As explained by Google on their post, "We intend to maintain support for APIv1 indefinitely". This leads me to believe that this is a worthy performance improvement, which gives significant results with little effort, and that can be rolled back trivially in the future if we decide to upgrade to the APIv2. Heck, we could even fix the gogo-protogen to support APIv2 -- it's not a trivial project, but it's something I could tackle if the need arises.

So how fast is the improvement in (de)serialization anyway?

I'm glad you asked! To answer this question, and ensure that the complexity of the change is worth it from a performance point of view, I designed a synthetic benchmark, included also in this PR. The benchmark exercises the ExecuteBatch API, which is a hot contention point between the vtgates and the vttablets -- most notably, is the contention point with the largest and most complex ProtoBuf messages being sent back and forth. To measure the overhead of serialization, we're running a local GRPC server and client (so all the queries are actually being serialized, sent over the wire, and deserialized -- same for the responses), although the server will actually not perform any operations on the request; it will return a dummy response. I believe this is a good way to measure the performance of the whole (de)serialization pipeline in the GRPC client and server.

As for the request and response payloads: they're generated randomly for a given array of target sizes (from 8 to 32k). These sizes are not measuring the space in bytes of the ProtoBuf messages, but rather their complexity. Higher numbers mean larger and more complex messages, with more recursive data structures in their fields.

Here are the graphs: we're looking at performance improvement for increasing request complexity (and a fixed response with a small size), and performance improvement for increasing response complexity (and a fixed request with a small size). Lastly, we'll take a look at the aggregate data. For all graphs, the Y axis is response time in ns per request, so smaller is better.

We can see that we're benchmarking 3 different implementations here:

protoc-gen-go is the default implementation that Vitess is currently using. It generates code to (de)serialize protobuf messages using Go's reflect package at runtime.
protoc-gen-gofast is the backwards-compatibility generator for gogo/protobuf. The generated code is fully unfolded as to not use any reflection, but the generated Go code (and structs) is fully backwards compatible with the github.com/golang/protobuf package APIs.
protoc-gen-gogofaster is the non-backwards compatible generator from gogo/protobuf. The generated (de)serialization code looks like the one in protoc-gen-gofast, but the resulting Go data structures are not compatible with github.com/golang/protobuf: instead, they depend on the github.com/gogo/protobuf package, allowing us to introduce optimizations at the Go API level that are not possible to do otherwise. Again, to reiterate, the serialized Protobuf for these three generators are identical. This is just changing the Go code to serialize the messages.

What can we gather from these comparisons? Well, the performance improvement is there and it is statistically significant. We can dig further with a table (this time comparing only the existing implementation against gogofaster, i.e. the fastest codegen available to us):

Statistical Performance Improvement (click to expand)

	name	old time/op (ns/op)	±	new time/op (ns/op)	±.1	delta	±.2
0	Req8-Resp8	14875.9	7%	13924.9	1%	-6.39%	(p=0.000 n=8+8)
1	Req8-Resp64	16413.4	5%	16366.1	1%	~	(p=0.959 n=8+8)
2	Req8-Resp512	55703.7	1%	47477	1%	-14.77%	(p=0.001 n=6+8)
3	Req8-Resp4096	159331	3%	152462	1%	-4.31%	(p=0.001 n=7+7)
4	Req8-Resp32768	1.1631e+06	3%	1.00909e+06	3%	-13.24%	(p=0.000 n=7+8)
5	Req64-Resp8	16763.7	7%	16264	1%	-2.98%	(p=0.026 n=7+7)
6	Req64-Resp64	19467	7%	18176.4	2%	-6.63%	(p=0.007 n=8+8)
7	Req64-Resp512	60196.5	6%	50280	2%	-16.47%	(p=0.000 n=8+8)
8	Req64-Resp4096	165405	3%	156575	2%	-5.34%	(p=0.000 n=8+8)
9	Req64-Resp32768	1.17851e+06	7%	1.02014e+06	5%	-13.44%	(p=0.000 n=8+8)
10	Req512-Resp8	59465.8	6%	42127.8	1%	-29.16%	(p=0.000 n=8+8)
11	Req512-Resp64	61014.5	1%	45093.2	1%	-26.09%	(p=0.001 n=6+8)
12	Req512-Resp512	102391	9%	76999.6	2%	-24.80%	(p=0.000 n=8+8)
13	Req512-Resp4096	202812	2%	182690	3%	-9.92%	(p=0.001 n=6+8)
14	Req512-Resp32768	1.1783e+06	3%	1.03586e+06	3%	-12.09%	(p=0.000 n=8+8)
15	Req4096-Resp8	334235	5%	218554	1%	-34.61%	(p=0.000 n=8+7)
16	Req4096-Resp64	338987	7%	218310	2%	-35.60%	(p=0.000 n=8+8)
17	Req4096-Resp512	367464	10%	242191	2%	-34.09%	(p=0.000 n=8+8)
18	Req4096-Resp4096	445590	5%	325993	2%	-26.84%	(p=0.000 n=8+8)
19	Req4096-Resp32768	1.38854e+06	2%	1.19323e+06	6%	-14.07%	(p=0.000 n=7+8)
20	Req32768-Resp8	2.52635e+06	3%	1.98634e+06	3%	-21.37%	(p=0.001 n=7+7)
21	Req32768-Resp64	2.5925e+06	6%	1.97678e+06	6%	-23.75%	(p=0.000 n=8+8)
22	Req32768-Resp512	2.62975e+06	3%	1.97621e+06	6%	-24.85%	(p=0.001 n=6+8)
23	Req32768-Resp4096	2.71453e+06	5%	2.08956e+06	9%	-23.02%	(p=0.000 n=8+8)
24	Req32768-Resp32768	3.68806e+06	6%	2.91223e+06	5%	-21.04%	(p=0.000 n=8+8)

The pattern is clear: the larger the messages, the more significant the performance improvement of the generator, with the impact being more noticeable in the request path (because requests for this API are more complex than responses, even though responses are usually larger). We're seeing up to 35% improvement in medium-sized requests. This is obviously not the actual improvement of the total request time in a production system (as, again, this benchmark does not perform computations on the server), but if we conservatively assume that total RPC overhead is around 10% of the response time for production requests (a conservative estimate, it's probably larger in complex requests), we're still looking at a very significant and realistic 2-3.5% improvement all across the board in a change that is fully backwards compatible.

Also worth noting, this is strictly a throughput benchmark: it does not take into account the reduced GC garbage generated by the new serializer and the new Go structs. Most notably: the extra metadata fields for unrecognized fields, which were not used at all by Vitess, has been stripped, resulting in less GC churn and smaller in-memory messages (very convenient for us since we store quite a few ProtoBuf messages in our Plan Caches). Similarly, the raw CPU cost of (de)serialization has been significantly reduced, increasing the total throughput of a given system before it reaches capacity and reducing provisioning costs.

Overall, I can't think of many more optimizations that will give this kind of results to the overall performance of the distributed system without any actually backwards compatible changes -- and, interestingly, without any new external dependencies, as the gogo/protobuf package is already being used by Vitess as part of the etcd and kubernetes clients in the codebase.

Related Issue(s)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

To emphasize this for the Nth time: the changes in the serialization are happening at the Go level and should be fully backwards compatible on the wire vs older Vitess versions.

Impacted Areas in Vitess

Components that this PR will affect:

derekperkins · 2021-02-19T16:54:11Z

I'm not sure that there's any performance gain worth switching to gogo. All of the choices by major projects to use it was prior to the new api, and they haven't changed yet simply because of the work required to change.

The project has been almost completely abandoned for nearly a year, with nobody willing to commit to taking over. GoGo Protobuf looking for new ownership gogo/protobuf#691
etcd using gogo has caused an enormous amount of pain for anyone using it as a dependency Please upgrade grpc version used in etcd go.mod etcd-io/etcd#12470 etcd not compatible with grpc v1.30.0 etcd-io/etcd#12124
Kubernetes is looking to migrate away Need a replacement for deprecated/unmaintained gogo-protobuf dependency kubernetes/kubernetes#96564 https://groups.google.com/g/kubernetes-sig-api-machinery/c/tcwFubV9Boo/m/MAdQfLY_AgAJ
Anyone using our Go database driver (we're one) is actively prevented from upgrading to newer versions of grpc-go because of that etcd dependency, and picking this incompatible fork is likely to cause more issues in the future
Other decent sized projects have migrated to v2 Switch to protoc-gen-go uber/cadence#3905

I don't have a personal opinion about the new apis, though you're not the first person I've talked to with a similar opinion. I just feel like this is backing ourselves into a corner that we are going to seriously regret later.

rafael · 2021-02-19T16:54:34Z

The pattern is clear: the larger the messages, the more significant the performance improvement of the generator, with the impact being more noticeable in the request path (because requests for this API are more complex than responses, even though responses are usually larger). We're seeing up to 35% improvement in medium-sized requests. This is obviously not the actual improvement of the total request time in a production system (as, again, this benchmark does not perform computations on the server), but if we conservatively assume that total RPC overhead is around 10% of the response time for production requests (a conservative estimate, it's probably larger in complex requests), we're still looking at a very significant and realistic 2-3.5% improvement all across the board in a change that is fully backwards compatible.

This is really interesting and promising. We have been chasing problems with large messages for years. But we continue to run into various issues. I think improvements like this are definitely steps on the right direction.

This is really good!

derekperkins · 2021-02-19T17:51:51Z

We intend to maintain support for APIv1 indefinitely

I feel like I've seen enough Google deprecations to interpret this as, "We aren't working on this anymore except for CVEs, and it won't get any feature or performance improvements from v2. Once we have deemed that a critical mass has migrated, or there's a significant enough blocker, we will stop supporting it."

vmg · 2021-02-22T12:56:56Z

I'm not sure that there's any performance gain worth switching to gogo.

The unfortunate truth is that we're not going to find another 3% global throughput increase in Vitess without a massive engineering effort. That engineering effort could be scoped to other parts of the codebase, or it could be scoped to ensuring Gogo (or an equivalent) is properly maintained in the long term. The metrics show that our messages in production are actually quite large and complex, and they need to be optimized because they're showing up in the graphs.

How do we accomplish that in the most sustainable way? Well, my recommendation for this PR is going to be to use the backwards compatible Gogo generator, which doesn't introduce any actual dependencies on the unmaintained Gogo packages (even though, unfortunately, it's not the fastest). I think it's a good middle ground that gets us unrolled, faster code, while still ensuring we're depending on Google's APIs.

Once we have removed this low-hanging fruit, I intend to research how hard it'd be to wire up the generation to the new ProtoBuf API v2. It appears that the gogo project provides a lot of functionality which we don't intend to use and which would be very hard to port to the new API v2, but if we could depend on our own generator, which would be implemented as a plug-in to Google's protoc-gen-go, as opposed to gogo's approach of replacing the generator altogether, that seems like a good middle point for performance and maintainability -- again, always judging by the graphs showed in this PR.

Signed-off-by: Vicent Marti <vmg@strn.cat>

sougou · 2021-02-22T16:33:35Z

@vmg This is awesome research! 👍

Someone also told me that the http2 transport layer under grpc could be a CPU hog. It's another thing to look at, at least to confirm/deny the theory.

Signed-off-by: Vicent Marti <vmg@strn.cat>

derekperkins · 2021-02-22T17:54:26Z

I don't want to come across too negative about the potential performance gains, I think this is awesome. I just want to make sure we're going in with the understanding that gogoproto has been without maintainers for a year, and while this PR might be scoped cleanly, it's all too easy to have future code depend on specific gogo behavior that would be near impossible to back out.

if we could depend on our own generator, which would be implemented as a plug-in to Google's protoc-gen-go, as opposed to gogo's approach of replacing the generator altogether, that seems like a good middle point for performance and maintainability

This sounds like a good path forward. Given that we only need a subset of functionality and that the transport consumes so many resources, it seems reasonable to take control of that piece, and as you mentioned at the beginning, we could even explore non-protobuf protocols if data showed it was worth the effort.

vmg · 2021-02-23T10:11:36Z

👍 I definitely agree that depending on the gogo/protobuf package directly is too risky. I've refactored the PR so we only depend on the codegen and we remain fully backwards compatible. It's ready for review now!

shlomi-noach

The discussion around which library we should use is really good. My take is to approve based on @vmg's work and the understanding that we can roll back. I wonder, is there a way to lint the code such that we ensure none uses the "extra" features of gogo? That'll ensure we can always roll back.

vmg · 2021-02-24T10:41:23Z

@shlomi-noach it's already linted! I've forced the usage of the backwards-compatible generator, so any gogo-exclusive features that are introduced will just crash the build. 👌

Signed-off-by: Vicent Marti <vmg@strn.cat>

shlomi-noach · 2021-02-24T11:14:35Z

it's already linted!

Whoa!

systay

LGTM.

The red tests seem valid and not just flaky, right?

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg · 2021-02-24T15:42:47Z

@systay ayup I screwed up when merging master. Should be green now and ready to merge.

ajm188 · 2021-02-24T19:33:30Z

@vmg I'm getting the following running make proto after this is merged.

vtctldata.proto:167:3: "google.protobuf.Duration" is not defined.
vtctldata.proto:334:3: "google.protobuf.Duration" is not defined.

Do you happen to know a workaround for this?

Edit: I saw the changes to protoutil, and I had another branch that introduced new occurrences of google.protobuf.Duration to the vtctldata, which I updated to use the vttime type you added, and things are working great. Sorry for the false alarm!

vmg · 2021-02-25T09:17:27Z

Yes! You got it! We need to use our own vttime.Duration to ensure we can generate the optimized code for it. It's fully compatible with the default Duration in ProtoBufs

vmg requested review from ajm188, doeg, GuptaManan100, harshit-gangal, sougou and systay as code owners February 19, 2021 12:12

askdba added Component: Build/CI Component: Cluster management Component: VTAdmin VTadmin interface Component: Query Serving labels Feb 19, 2021

vmg requested review from deepthi, rohit-nayak-ps and shlomi-noach as code owners February 19, 2021 15:40

vmg force-pushed the vmg/gogoproto branch from ac46c53 to 47211a1 Compare February 22, 2021 16:09

vmg added 4 commits February 22, 2021 17:09

vttablet: implement conn benchmark

d963be8

Signed-off-by: Vicent Marti <vmg@strn.cat>

proto: regenerate with gogo/protobuf

c6bbc94

Signed-off-by: Vicent Marti <vmg@strn.cat>

vttablet: connection benchmark

0c528ce

Signed-off-by: Vicent Marti <vmg@strn.cat>

conn: better benchmark tests

7c792fd

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg force-pushed the vmg/gogoproto branch from 47211a1 to 7c792fd Compare February 22, 2021 16:09

vmg added 2 commits February 22, 2021 17:40

protoutil: fix unit test

79349c8

Signed-off-by: Vicent Marti <vmg@strn.cat>

conn: remove unused variable

0e8b388

Signed-off-by: Vicent Marti <vmg@strn.cat>

shlomi-noach approved these changes Feb 24, 2021

View reviewed changes

Merge branch 'master' into vmg/gogoproto

50e5f66

Signed-off-by: Vicent Marti <vmg@strn.cat>

systay approved these changes Feb 24, 2021

View reviewed changes

test: do not set nil schemas

a348a58

Signed-off-by: Vicent Marti <vmg@strn.cat>

shlomi-noach merged commit 5da35cd into vitessio:master Feb 24, 2021

askdba added this to the v10.0 milestone Mar 4, 2021

This was referenced Mar 7, 2021

[vtadmin] vschemas api endpoints #7625

Merged

Build protos via docker #6548

Closed

deepthi mentioned this pull request Mar 12, 2021

Performance Improvements #7674

Open

vmg mentioned this pull request May 24, 2021

proto: Generate faster code using vtprotobuf #8173

Merged

2 tasks

ajm188 mentioned this pull request Jul 14, 2021

slack vitess v10.pre tinyspeck/vitess#228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: RPC Serialization #7519

perf: RPC Serialization #7519

vmg commented Feb 19, 2021 •

edited

Loading

derekperkins commented Feb 19, 2021 •

edited

Loading

rafael commented Feb 19, 2021

derekperkins commented Feb 19, 2021

vmg commented Feb 22, 2021

sougou commented Feb 22, 2021

derekperkins commented Feb 22, 2021

vmg commented Feb 23, 2021

shlomi-noach left a comment

vmg commented Feb 24, 2021

shlomi-noach commented Feb 24, 2021

systay left a comment

vmg commented Feb 24, 2021

ajm188 commented Feb 24, 2021 •

edited

Loading

vmg commented Feb 25, 2021

perf: RPC Serialization #7519

perf: RPC Serialization #7519

Conversation

vmg commented Feb 19, 2021 • edited Loading

Description

If this is a magical performance improvement, why isn't everybody using it?

Are we likely to switch to Protobuf's APIv2 in the future?

So how fast is the improvement in (de)serialization anyway?

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

derekperkins commented Feb 19, 2021 • edited Loading

rafael commented Feb 19, 2021

derekperkins commented Feb 19, 2021

vmg commented Feb 22, 2021

sougou commented Feb 22, 2021

derekperkins commented Feb 22, 2021

vmg commented Feb 23, 2021

shlomi-noach left a comment

Choose a reason for hiding this comment

vmg commented Feb 24, 2021

shlomi-noach commented Feb 24, 2021

systay left a comment

Choose a reason for hiding this comment

vmg commented Feb 24, 2021

ajm188 commented Feb 24, 2021 • edited Loading

vmg commented Feb 25, 2021

vmg commented Feb 19, 2021 •

edited

Loading

derekperkins commented Feb 19, 2021 •

edited

Loading

ajm188 commented Feb 24, 2021 •

edited

Loading