Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on JSON serialization when using HTTP query #7250

Closed
jffifa opened this issue Sep 1, 2016 · 13 comments
Closed

Poor performance on JSON serialization when using HTTP query #7250

jffifa opened this issue Sep 1, 2016 · 13 comments

Comments

@jffifa
Copy link

jffifa commented Sep 1, 2016

I'm running an influxdb v0.13 instance on Ubuntu 14.04LTS. It seems that influxdb shows a poor performance when the size of result JSON is large for HTTP query.

For example, when I simply run a count query on a field named count, the result JSON is quite small.

q=SELECT COUNT(count) FROM argus."7d".caller_10s WHERE time>='2016-08-30T23:01:30Z' AND time<'2016-08-31T11:01:20Z'

time curl 'localhost:8086/query?db=argus' --data-binary @d

{"results":[{"series":[{"name":"caller_10s","columns":["time","count"],"values":[["2016-08-30T23:01:30Z",725675]]}]}]}

Time stats:

real    0m1.483s
user    0m0.006s
sys 0m0.005s

However, when I want to list the field(725675 row in total as shown above), the query becomes slow. The size of result JSON is 19MB.

q=SELECT count FROM argus."7d".caller_10s WHERE time>='2016-08-30T23:01:30Z' AND time<'2016-08-31T11:01:20Z'

time curl 'localhost:8086/query?db=argus' --data-binary @d > res

Time stats

real    0m11.533s
user    0m0.005s
sys 0m0.018s

And it becomes slower if I select more fields(even if duplicate the same field count). The size of result JSON is 22MB.

q=SELECT count, count, count FROM argus."7d".caller_10s WHERE time>='2016-08-30T23:01:30Z' AND time<'2016-08-31T11:01:20Z'  # just select duplicate field

time curl 'localhost:8086/query?db=argus' --data-binary @d > res

Time stats

real    0m17.639s
user    0m0.004s
sys 0m0.066

I guess the poor performance is due to the JSON serialization process on large dataset for influxdb. The HTTP query is through interface lo so that network speed is not the cause. Neither is the hard drive's io as I have tried to redirect the stdout to /dev/null while the time cost remains the same.

Would you mind do a profiling on JSON serialization?

@jsternberg
Copy link
Contributor

Possibly related to #7154.

@jsternberg
Copy link
Contributor

So I spent a bunch of time on this and came out with what amounted to nothing. I was performing performance comparison tests between the tinylib/msgp library and the JSON one we've got right now and I, weirdly, came out with having no difference between them. When I performed the benchmarks, msgpack was definitely faster, but when I attempted to profile it with a query in the actual server, my response time was mostly the same.

I attempted to also do this with just discarding the output on the wire completely so I didn't have to compare different serialization methods. The results were mostly the same. I'll try to find some time this week to reproduce my results (since I'm relaying my results from memory right now), but I think the current blocker is within the query engine. While making the serialization faster is a noble goal, I think it won't produce any sizeable gains because a lot of the hotspots I found in the heat map were the garbage collector and channel operations. The garbage collector is because we don't handle memory as efficiently as we probably should and the channels because selecting raw fields produces channels at the moment.

More testing is needed to find what hotspots are there and how to optimize them, but I don't think the JSON encoder is the current hotspot.

@jwilder jwilder added this to the 1.1.0 milestone Sep 16, 2016
@jffifa
Copy link
Author

jffifa commented Sep 23, 2016

After reading your reply I think there may be another cause if the serialization is not the hotspot: create a large number of query result objects one by one, somehow like

results = []
for r in rawResults:
    results.append(new Result(r))

I'm not familiar with golang or its compiler. A naive implemented compiler and garbage collector may lead such operation to a huge number of memory allocating and freeing system calls thus making the process run into userspace and kernel space back and forth, which may cost cpu a lot.

As far as I know, such code in Python will make a poor performance. I'm not sure golang as a compiled language will do so or not. Maybe you should use techniques like memory pool to improve performance.

The word "poor", I mean, is compared with popular RDBMS like MySQL. I've created a similar table with a "time" column B-Tree indexed, it seemed that if the result set was large, MySQL did much better.

As you said, more performance testing and profiling is needed to check if optimization may gain a significant performance enhancement, or we just reach the limitation of golang.

Anyway, thanks for considering performance improving. My use case is listing API statistics in our systems for a period of time so a result with thousands rows are quite common. If you need any detailed information to help you , I'm glad to share.

@jwilder jwilder modified the milestones: 1.2.0, 1.1.0 Oct 6, 2016
@jsternberg jsternberg removed their assignment Nov 29, 2016
@jwilder jwilder removed this from the 1.2.0 milestone Jan 3, 2017
@rbetts
Copy link
Contributor

rbetts commented Apr 26, 2017

@jwilder commented out-of-band: "we’ve thought about switching to ffjson or some non-reflection based json marhaller. related: #4363"

@stuartcarnie
Copy link
Contributor

I used easyjson for serialization / deserialization of billions of JSON messages over web sockets. Their benchmarks are pretty comprehensive too.

@stuartcarnie
Copy link
Contributor

In addition, is it worth adding support for RFC 7464: JavaScript Object Notation (JSON) Text Sequences.

If the client includes an Accept: application/json-seq header, the server would write data in a similar way to the csv writer (header row followed by values).

{ "name": "cluster", "columns": ["time", "clusterID", "copyShardReq", "createIteratorReq", "expandSourcesReq", "fieldDimensionsReq", "hostname", "nodeID", "removeShardReq", "writeShardFail", "writeShardPointsReq"] }
["2017-04-24T23:59:10Z","535175895417456351",0,0,0,0,"stuart-influx.local","data-0:8088",0,0,0,0]
["2017-04-24T23:59:20Z","535175895417456351",0,0,0,0,"stuart-influx.local","data-0:8088",0,0,0,0]

@jsternberg
Copy link
Contributor

I don't actually think there is a marshaling problem. I tried to make marshaling faster back when I looked at this awhile ago, but it didn't speed up real performance because the limiting factors weren't based on marshaling.

Further, we should consider moving away from JSON and just supporting JSON as a debugging mechanism. JSON causes a bunch of other unrelated issues since it can't differentiate between floats and ints and it cannot accurately represent integers above 2^53. Since we support 2^63 with signed 64-bit integers, this is a bit inconvenient.

@stuartcarnie
Copy link
Contributor

Agreed – we'll start with code-generated serialization

@stuartcarnie
Copy link
Contributor

I have added results of some detailed analysis here. I used easyjson to generate serialization methods for the core types involved in /query responses. I chose easyjson, as I have used it in a previous production project.

@seebs
Copy link
Contributor

seebs commented Oct 20, 2018

Once long ago, I was annoyed by things to do with JSON encoding and decoding when talking to influxdata, and I was thinking about that and thought "this has to be expensive, right?", and thus I somehow ended up here, because I stumbled across the csv/msgpack support on the server side, which don't have corresponding functionality on the client side, which is why I never noticed that they existed.

I was a little surprised at the relatively small magnitude of the improvements from improving the marshalling code, but after looking at it more closely, I think this is probably a data structure issue.

10k rows with three values in each row is 10k slices of three interface{}, each of which in turn has to have a pointer to an underlying object. This is a fairly large volume of pointers for the GC to track, and it probably means they're all different allocations; I'm pretty sure that they will be after unmarshalling, in any event.

Performance might be improved by coalescing the allocations for the backing store, but my intuition is that if you really want to reduce the memory/CPU overhead much, you'd need to switch to slices of underlying types. So, instead of each row having a slice of interface{}, each series would have a slice of interface{} -- each member of which would be a slice of a concrete type, holding the concrete values for one column. So, if you have a column of timestamps, that would be a single slice of 10,000 time.Time, instead of 10,000 individual interface{} each wrapping a time.Time.

@seebs
Copy link
Contributor

seebs commented Oct 26, 2018

FWIW, on my system, dumping the same exact largeish (~1.5M row) query (curl ... > /dev/null) takes about 10 seconds with JSON, and 5 seconds with msgpack. In both cases, almost the entire time passes before any data is sent, which may also be inefficient. (But also, JSON is being compressed, and msgpack isn't, which costs a lot of CPU to save a lot of bandwidth.)

(Also, I was about to say "oh, and nevermind, ints and pointers can be inlined in interfaces", but they actually can't since apparently around 1.5, because that caused problems for gc.)

@dgnorton dgnorton added the 1.x label Jan 7, 2019
@stale
Copy link

stale bot commented Jul 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 23, 2019
@stale
Copy link

stale bot commented Jul 31, 2019

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

@stale stale bot closed this as completed Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants