Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: RPC Serialization #7519

Merged
merged 8 commits into from
Feb 24, 2021
Merged

perf: RPC Serialization #7519

merged 8 commits into from
Feb 24, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Feb 19, 2021

Description

Hellooooo computer fiends. Coming up next in my series of performance improvements, we're tracking RPC performance, starting with the lowest hanging fruit: serialization. Right now all the RPC communication for Vitess happens through GRPC (between vtgate and the tablets, for the admin API, etc), and GRPC kind of forces us towards using Protocol Buffers for (de)serialization. This is actually not a hard requirement on the GRPC API, as it supports custom serializers, but this is not a particularly well tested interface, so in the spirit of incremental, non-breaking changes, let us not go down that road (at least for the time being).

So, since we're stuck with Protobuf, what options do we have to improve its performance? The most obvious answer is gogo/protobuf. For those who've never heard of the Gogo generator, it's an alternative implementation of the protoc-gen-go plugin that comes with Google's go/protobuf package. The new generator supports many new features, the most relevant (for us) being the full unfolding of the generated serialization and deserialization code for all our Protobuf messages. The default implementation of protoc-gen-go generates code that uses reflection at runtime to (de)serialize messages, and that is a very significant performance overhead.

By simply switching the generator for protobuf files from Google's to the gogo fork, we can scrape a significant amount of performance when (de)serializing Protobuf messages from/to the wire in GRPC without changing at all the actual shape of the messages on the wire (i.e. this is a fully transparent and backwards compatible change).

Now, relevant questions:

If this is a magical performance improvement, why isn't everybody using it?

Well, they kinda are. It turns out that Vitess is the only major Go distributed system that is using the default Google Protobuf generator. Everybody else (Kubernetes, etcd, Prometheus, etc...) has switched to it a while ago. There is, however, a new development in the Go/Protobufs/GRPC ecosystem that is worth taking into account: last year, Google released a new Protobuf package for Go with a new, non-backwards compatible API. I have, huh, strong feelings about the design of the package (which I may summarize as "it is a bit of a tire fire"), and it seems like the community consensus is similar, because no major software project has switched to the new package's APIs. The new Protobuf v2 package is fully identical on-the-wire, but it changes the Go APIs for (de)serialization, particularly by allowing many dynamic reflection features for PB messages that were hard to do with the old API. Obviously, Vitess does not do any reflection (or support dynamic protobuf messages at all), and neither do most of the major Go projects using ProtoBuf right now, something which is probably causing the glacially slow (nonexistent?) adoption of the new package.

Most relevantly to the this discussion: Go's Protobuf V2 package has enough breaking changes that it doesn't look like the Gogo Protobuf generator is going to support it anytime soon. Hence, if we switch to the new generator (something which, as you can see on this PR's code changes, is rather trivial), if we ever decide to upgrade to the APIv2 for the Protobuf package, we'd essentially lose all the performance improvements we're gaining here.

Are we likely to switch to Protobuf's APIv2 in the future?

I don't think so? As explained early, APIv1 and APIv2 on the Protobuf packages are identical on the wire. All that has changed are the actual APIs in the Go side of things to give better support for dynamic message reflection. As explained by Google on their post, "We intend to maintain support for APIv1 indefinitely". This leads me to believe that this is a worthy performance improvement, which gives significant results with little effort, and that can be rolled back trivially in the future if we decide to upgrade to the APIv2. Heck, we could even fix the gogo-protogen to support APIv2 -- it's not a trivial project, but it's something I could tackle if the need arises.

So how fast is the improvement in (de)serialization anyway?

I'm glad you asked! To answer this question, and ensure that the complexity of the change is worth it from a performance point of view, I designed a synthetic benchmark, included also in this PR. The benchmark exercises the ExecuteBatch API, which is a hot contention point between the vtgates and the vttablets -- most notably, is the contention point with the largest and most complex ProtoBuf messages being sent back and forth. To measure the overhead of serialization, we're running a local GRPC server and client (so all the queries are actually being serialized, sent over the wire, and deserialized -- same for the responses), although the server will actually not perform any operations on the request; it will return a dummy response. I believe this is a good way to measure the performance of the whole (de)serialization pipeline in the GRPC client and server.

As for the request and response payloads: they're generated randomly for a given array of target sizes (from 8 to 32k). These sizes are not measuring the space in bytes of the ProtoBuf messages, but rather their complexity. Higher numbers mean larger and more complex messages, with more recursive data structures in their fields.

Here are the graphs: we're looking at performance improvement for increasing request complexity (and a fixed response with a small size), and performance improvement for increasing response complexity (and a fixed request with a small size). Lastly, we'll take a look at the aggregate data. For all graphs, the Y axis is response time in ns per request, so smaller is better.

protobuf_req
protobuf_resp
protobuf_agg

We can see that we're benchmarking 3 different implementations here:

  • protoc-gen-go is the default implementation that Vitess is currently using. It generates code to (de)serialize protobuf messages using Go's reflect package at runtime.
  • protoc-gen-gofast is the backwards-compatibility generator for gogo/protobuf. The generated code is fully unfolded as to not use any reflection, but the generated Go code (and structs) is fully backwards compatible with the github.com/golang/protobuf package APIs.
  • protoc-gen-gogofaster is the non-backwards compatible generator from gogo/protobuf. The generated (de)serialization code looks like the one in protoc-gen-gofast, but the resulting Go data structures are not compatible with github.com/golang/protobuf: instead, they depend on the github.com/gogo/protobuf package, allowing us to introduce optimizations at the Go API level that are not possible to do otherwise. Again, to reiterate, the serialized Protobuf for these three generators are identical. This is just changing the Go code to serialize the messages.

What can we gather from these comparisons? Well, the performance improvement is there and it is statistically significant. We can dig further with a table (this time comparing only the existing implementation against gogofaster, i.e. the fastest codegen available to us):

Statistical Performance Improvement (click to expand)
name old time/op (ns/op) ± new time/op (ns/op) ±.1 delta ±.2
0 Req8-Resp8 14875.9 7% 13924.9 1% -6.39% (p=0.000 n=8+8)
1 Req8-Resp64 16413.4 5% 16366.1 1% ~ (p=0.959 n=8+8)
2 Req8-Resp512 55703.7 1% 47477 1% -14.77% (p=0.001 n=6+8)
3 Req8-Resp4096 159331 3% 152462 1% -4.31% (p=0.001 n=7+7)
4 Req8-Resp32768 1.1631e+06 3% 1.00909e+06 3% -13.24% (p=0.000 n=7+8)
5 Req64-Resp8 16763.7 7% 16264 1% -2.98% (p=0.026 n=7+7)
6 Req64-Resp64 19467 7% 18176.4 2% -6.63% (p=0.007 n=8+8)
7 Req64-Resp512 60196.5 6% 50280 2% -16.47% (p=0.000 n=8+8)
8 Req64-Resp4096 165405 3% 156575 2% -5.34% (p=0.000 n=8+8)
9 Req64-Resp32768 1.17851e+06 7% 1.02014e+06 5% -13.44% (p=0.000 n=8+8)
10 Req512-Resp8 59465.8 6% 42127.8 1% -29.16% (p=0.000 n=8+8)
11 Req512-Resp64 61014.5 1% 45093.2 1% -26.09% (p=0.001 n=6+8)
12 Req512-Resp512 102391 9% 76999.6 2% -24.80% (p=0.000 n=8+8)
13 Req512-Resp4096 202812 2% 182690 3% -9.92% (p=0.001 n=6+8)
14 Req512-Resp32768 1.1783e+06 3% 1.03586e+06 3% -12.09% (p=0.000 n=8+8)
15 Req4096-Resp8 334235 5% 218554 1% -34.61% (p=0.000 n=8+7)
16 Req4096-Resp64 338987 7% 218310 2% -35.60% (p=0.000 n=8+8)
17 Req4096-Resp512 367464 10% 242191 2% -34.09% (p=0.000 n=8+8)
18 Req4096-Resp4096 445590 5% 325993 2% -26.84% (p=0.000 n=8+8)
19 Req4096-Resp32768 1.38854e+06 2% 1.19323e+06 6% -14.07% (p=0.000 n=7+8)
20 Req32768-Resp8 2.52635e+06 3% 1.98634e+06 3% -21.37% (p=0.001 n=7+7)
21 Req32768-Resp64 2.5925e+06 6% 1.97678e+06 6% -23.75% (p=0.000 n=8+8)
22 Req32768-Resp512 2.62975e+06 3% 1.97621e+06 6% -24.85% (p=0.001 n=6+8)
23 Req32768-Resp4096 2.71453e+06 5% 2.08956e+06 9% -23.02% (p=0.000 n=8+8)
24 Req32768-Resp32768 3.68806e+06 6% 2.91223e+06 5% -21.04% (p=0.000 n=8+8)

The pattern is clear: the larger the messages, the more significant the performance improvement of the generator, with the impact being more noticeable in the request path (because requests for this API are more complex than responses, even though responses are usually larger). We're seeing up to 35% improvement in medium-sized requests. This is obviously not the actual improvement of the total request time in a production system (as, again, this benchmark does not perform computations on the server), but if we conservatively assume that total RPC overhead is around 10% of the response time for production requests (a conservative estimate, it's probably larger in complex requests), we're still looking at a very significant and realistic 2-3.5% improvement all across the board in a change that is fully backwards compatible.

Also worth noting, this is strictly a throughput benchmark: it does not take into account the reduced GC garbage generated by the new serializer and the new Go structs. Most notably: the extra metadata fields for unrecognized fields, which were not used at all by Vitess, has been stripped, resulting in less GC churn and smaller in-memory messages (very convenient for us since we store quite a few ProtoBuf messages in our Plan Caches). Similarly, the raw CPU cost of (de)serialization has been significantly reduced, increasing the total throughput of a given system before it reaches capacity and reducing provisioning costs.

Overall, I can't think of many more optimizations that will give this kind of results to the overall performance of the distributed system without any actually backwards compatible changes -- and, interestingly, without any new external dependencies, as the gogo/protobuf package is already being used by Vitess as part of the etcd and kubernetes clients in the codebase.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

To emphasize this for the Nth time: the changes in the serialization are happening at the Go level and should be fully backwards compatible on the wire vs older Vitess versions.

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

@derekperkins
Copy link
Member

derekperkins commented Feb 19, 2021

I'm not sure that there's any performance gain worth switching to gogo. All of the choices by major projects to use it was prior to the new api, and they haven't changed yet simply because of the work required to change.

I don't have a personal opinion about the new apis, though you're not the first person I've talked to with a similar opinion. I just feel like this is backing ourselves into a corner that we are going to seriously regret later.

@rafael
Copy link
Member

rafael commented Feb 19, 2021

The pattern is clear: the larger the messages, the more significant the performance improvement of the generator, with the impact being more noticeable in the request path (because requests for this API are more complex than responses, even though responses are usually larger). We're seeing up to 35% improvement in medium-sized requests. This is obviously not the actual improvement of the total request time in a production system (as, again, this benchmark does not perform computations on the server), but if we conservatively assume that total RPC overhead is around 10% of the response time for production requests (a conservative estimate, it's probably larger in complex requests), we're still looking at a very significant and realistic 2-3.5% improvement all across the board in a change that is fully backwards compatible.

This is really interesting and promising. We have been chasing problems with large messages for years. But we continue to run into various issues. I think improvements like this are definitely steps on the right direction.

This is really good!

@derekperkins
Copy link
Member

We intend to maintain support for APIv1 indefinitely

I feel like I've seen enough Google deprecations to interpret this as, "We aren't working on this anymore except for CVEs, and it won't get any feature or performance improvements from v2. Once we have deemed that a critical mass has migrated, or there's a significant enough blocker, we will stop supporting it."

@vmg
Copy link
Collaborator Author

vmg commented Feb 22, 2021

I'm not sure that there's any performance gain worth switching to gogo.

The unfortunate truth is that we're not going to find another 3% global throughput increase in Vitess without a massive engineering effort. That engineering effort could be scoped to other parts of the codebase, or it could be scoped to ensuring Gogo (or an equivalent) is properly maintained in the long term. The metrics show that our messages in production are actually quite large and complex, and they need to be optimized because they're showing up in the graphs.

How do we accomplish that in the most sustainable way? Well, my recommendation for this PR is going to be to use the backwards compatible Gogo generator, which doesn't introduce any actual dependencies on the unmaintained Gogo packages (even though, unfortunately, it's not the fastest). I think it's a good middle ground that gets us unrolled, faster code, while still ensuring we're depending on Google's APIs.

Once we have removed this low-hanging fruit, I intend to research how hard it'd be to wire up the generation to the new ProtoBuf API v2. It appears that the gogo project provides a lot of functionality which we don't intend to use and which would be very hard to port to the new API v2, but if we could depend on our own generator, which would be implemented as a plug-in to Google's protoc-gen-go, as opposed to gogo's approach of replacing the generator altogether, that seems like a good middle point for performance and maintainability -- again, always judging by the graphs showed in this PR.

Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
@sougou
Copy link
Contributor

sougou commented Feb 22, 2021

@vmg This is awesome research! 👍

Someone also told me that the http2 transport layer under grpc could be a CPU hog. It's another thing to look at, at least to confirm/deny the theory.

Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
@derekperkins
Copy link
Member

I don't want to come across too negative about the potential performance gains, I think this is awesome. I just want to make sure we're going in with the understanding that gogoproto has been without maintainers for a year, and while this PR might be scoped cleanly, it's all too easy to have future code depend on specific gogo behavior that would be near impossible to back out.

if we could depend on our own generator, which would be implemented as a plug-in to Google's protoc-gen-go, as opposed to gogo's approach of replacing the generator altogether, that seems like a good middle point for performance and maintainability

This sounds like a good path forward. Given that we only need a subset of functionality and that the transport consumes so many resources, it seems reasonable to take control of that piece, and as you mentioned at the beginning, we could even explore non-protobuf protocols if data showed it was worth the effort.

@vmg
Copy link
Collaborator Author

vmg commented Feb 23, 2021

👍 I definitely agree that depending on the gogo/protobuf package directly is too risky. I've refactored the PR so we only depend on the codegen and we remain fully backwards compatible. It's ready for review now!

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion around which library we should use is really good. My take is to approve based on @vmg's work and the understanding that we can roll back. I wonder, is there a way to lint the code such that we ensure none uses the "extra" features of gogo? That'll ensure we can always roll back.

@vmg
Copy link
Collaborator Author

vmg commented Feb 24, 2021

@shlomi-noach it's already linted! I've forced the usage of the backwards-compatible generator, so any gogo-exclusive features that are introduced will just crash the build. 👌

Signed-off-by: Vicent Marti <vmg@strn.cat>
@shlomi-noach
Copy link
Contributor

it's already linted!

Whoa!

Copy link
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The red tests seem valid and not just flaky, right?

Signed-off-by: Vicent Marti <vmg@strn.cat>
@vmg
Copy link
Collaborator Author

vmg commented Feb 24, 2021

@systay ayup I screwed up when merging master. Should be green now and ready to merge.

@shlomi-noach shlomi-noach merged commit 5da35cd into vitessio:master Feb 24, 2021
@ajm188
Copy link
Contributor

ajm188 commented Feb 24, 2021

@vmg I'm getting the following running make proto after this is merged.

vtctldata.proto:167:3: "google.protobuf.Duration" is not defined.
vtctldata.proto:334:3: "google.protobuf.Duration" is not defined.

Do you happen to know a workaround for this?

Edit: I saw the changes to protoutil, and I had another branch that introduced new occurrences of google.protobuf.Duration to the vtctldata, which I updated to use the vttime type you added, and things are working great. Sorry for the false alarm!

@vmg
Copy link
Collaborator Author

vmg commented Feb 25, 2021

Yes! You got it! We need to use our own vttime.Duration to ensure we can generate the optimized code for it. It's fully compatible with the default Duration in ProtoBufs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants