Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a client stream WriteSpan method in grpc storage plugin #3636

Closed
vuuihc opened this issue Apr 21, 2022 · 3 comments · Fixed by #3640
Closed

Add a client stream WriteSpan method in grpc storage plugin #3636

vuuihc opened this issue Apr 21, 2022 · 3 comments · Fixed by #3640

Comments

@vuuihc
Copy link
Contributor

vuuihc commented Apr 21, 2022

Hi, thanks for this outstanding project. I have a performance improvement suggestion for the grpc storage plugin.

Requirement - what kind of business use case are you trying to solve?

When saving traces with jaeger-clickhouse storage plugin, I found that the all-in-one process's CPU utilization is relatively high, while the write throughput is not large enough. Through performance analysis, I found that grpc communication occupies a relatively large proportion of CPU, and this can be optimized.

Problem - what in Jaeger blocks you from solving the requirement?

In the grpc service definition of jaeger's storage grpc plugin, the WriteSpan method is an unary rpc method

Proposal - what do you suggest to solve the problem or improve the existing situation?

Write to storage with stream rpc. Specifically, we may add a client stream method WriteSpanStreamingly

service SpanWriterPlugin {
    // spanstore/Writer
    rpc WriteSpan(WriteSpanRequest) returns (WriteSpanResponse);
    rpc WriteSpanStreamingly(stream WriteSpanRequest) returns (WriteSpanResponse); // add this method
    rpc Close(CloseWriterRequest) returns (CloseWriterResponse);
}

In my test, using client stream RPC compared to using unary RPC, under the same throughput pressure, the CPU utilization decreased by 35%; under the same hardware and software parameter configuration, the maximum throughput increased by 76%. Below is the detailed data.

test result

environment

CPU=10 * (Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz)
MEM=40GB
COLLECTOR_QUEUE_SIZE=4096
COLLECTOR_NUM_WORKERS=102
"the same write pressure": call greeter grpc service with ghz, concurrency=100 and num=5000000, 11 spans per request.

data summary

under the same write pressure CPU usage Maximum throughput under the same hardware and software conditions (spans/sec)
WriteSpan 186% 25974
WriteSpanStreamingly 120% 45775
Diff ↓35% ↑76%

pprof flame graph

unary
unary

stream
stream

Any open questions to address

If there is no problem, I am willing to open a PR to complete it.

@yurishkuro
Copy link
Member

A couple comments:

(1) adding a new method to the existing interface is a breaking change, so our approach so far was to add different interfaces and expose the support for those via the Capabilities API.

Following this methodology, you could add a new interface

service StreamingSpanWriterPlugin {
    rpc WriteSpanStream(stream WriteSpanRequest) returns (WriteSpanResponse);
}

(2) It is not clear to me how the error path will be communicated back to Jaeger with a streaming approach. You are showing WriteSpanResponse, which is meant to represent the result of writing a single span, not a collection of spans via stream. If you make the response (stream WriteSpanResponse), then it's not clear to me how you would be able to decide which response corresponds to which span.

The error from the storage is a peculiar thing, we tend not to do much with it in the Jaeger collector because the processing is async (with queue in between), so we cannot propagate the error back to the exporter. Even in the Kafka ingester I think we will drop the span if it returns an error from write, since we don't want to hold the rest of the Kafka stream.

So it is possible that your approach might work fine, but the error handling and dropping of spans will be happening lower in the stack. Which might be ok, but it would make the drop metrics misleading since they are produced at a higher level in the stack.

@yurishkuro
Copy link
Member

I am also curious how any sort of backpressure could be done via gRPC stream. What would prevent the collector from just writing spans to the stream much faster than the storage plugin can process them?

@vuuihc
Copy link
Contributor Author

vuuihc commented Apr 24, 2022

@yurishkuro hi Yuri

I am also curious how any sort of backpressure could be done via gRPC stream. What would prevent the collector from just writing spans to the stream much faster than the storage plugin can process them?

gRPC stream's flow control is based on http2 flow control, so there will be backpressure if the server WriteSpan block for low performance of storage.
reference:
https://github.com/grpc/grpc-go/blob/master/stream.go#L113
https://httpwg.org/specs/rfc7540.html#FlowControl

So it is possible that your approach might work fine, but the error handling and dropping of spans will be happening lower in the stack. Which might be ok, but it would make the drop metrics misleading since they are produced at a higher level in the stack.

my idea here is that when using the grpc plugin, the saved_by_svr metric may only represent the result of calling grpc WriteSpan, not the result of writing to storage, unless the writespan in the server must be synchronously implemented (obviously we do not have limit). so in this scenario, we should let the plugin provide more detailed write metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants