Poor performance on JSON serialization when using HTTP query #7250

jffifa · 2016-09-01T07:21:56Z

I'm running an influxdb v0.13 instance on Ubuntu 14.04LTS. It seems that influxdb shows a poor performance when the size of result JSON is large for HTTP query.

For example, when I simply run a count query on a field named count, the result JSON is quite small.

q=SELECT COUNT(count) FROM argus."7d".caller_10s WHERE time>='2016-08-30T23:01:30Z' AND time<'2016-08-31T11:01:20Z'

time curl 'localhost:8086/query?db=argus' --data-binary @d

{"results":[{"series":[{"name":"caller_10s","columns":["time","count"],"values":[["2016-08-30T23:01:30Z",725675]]}]}]}

Time stats:

real    0m1.483s
user    0m0.006s
sys 0m0.005s

However, when I want to list the field(725675 row in total as shown above), the query becomes slow. The size of result JSON is 19MB.

q=SELECT count FROM argus."7d".caller_10s WHERE time>='2016-08-30T23:01:30Z' AND time<'2016-08-31T11:01:20Z'

time curl 'localhost:8086/query?db=argus' --data-binary @d > res

Time stats

real    0m11.533s
user    0m0.005s
sys 0m0.018s

And it becomes slower if I select more fields(even if duplicate the same field count). The size of result JSON is 22MB.

q=SELECT count, count, count FROM argus."7d".caller_10s WHERE time>='2016-08-30T23:01:30Z' AND time<'2016-08-31T11:01:20Z'  # just select duplicate field

time curl 'localhost:8086/query?db=argus' --data-binary @d > res

Time stats

real    0m17.639s
user    0m0.004s
sys 0m0.066

I guess the poor performance is due to the JSON serialization process on large dataset for influxdb. The HTTP query is through interface lo so that network speed is not the cause. Neither is the hard drive's io as I have tried to redirect the stdout to /dev/null while the time cost remains the same.

Would you mind do a profiling on JSON serialization?

The text was updated successfully, but these errors were encountered:

jsternberg · 2016-09-01T16:55:22Z

Possibly related to #7154.

jsternberg · 2016-09-12T14:52:56Z

So I spent a bunch of time on this and came out with what amounted to nothing. I was performing performance comparison tests between the tinylib/msgp library and the JSON one we've got right now and I, weirdly, came out with having no difference between them. When I performed the benchmarks, msgpack was definitely faster, but when I attempted to profile it with a query in the actual server, my response time was mostly the same.

I attempted to also do this with just discarding the output on the wire completely so I didn't have to compare different serialization methods. The results were mostly the same. I'll try to find some time this week to reproduce my results (since I'm relaying my results from memory right now), but I think the current blocker is within the query engine. While making the serialization faster is a noble goal, I think it won't produce any sizeable gains because a lot of the hotspots I found in the heat map were the garbage collector and channel operations. The garbage collector is because we don't handle memory as efficiently as we probably should and the channels because selecting raw fields produces channels at the moment.

More testing is needed to find what hotspots are there and how to optimize them, but I don't think the JSON encoder is the current hotspot.

jffifa · 2016-09-23T11:15:09Z

After reading your reply I think there may be another cause if the serialization is not the hotspot: create a large number of query result objects one by one, somehow like

results = []
for r in rawResults:
    results.append(new Result(r))

I'm not familiar with golang or its compiler. A naive implemented compiler and garbage collector may lead such operation to a huge number of memory allocating and freeing system calls thus making the process run into userspace and kernel space back and forth, which may cost cpu a lot.

As far as I know, such code in Python will make a poor performance. I'm not sure golang as a compiled language will do so or not. Maybe you should use techniques like memory pool to improve performance.

The word "poor", I mean, is compared with popular RDBMS like MySQL. I've created a similar table with a "time" column B-Tree indexed, it seemed that if the result set was large, MySQL did much better.

As you said, more performance testing and profiling is needed to check if optimization may gain a significant performance enhancement, or we just reach the limitation of golang.

Anyway, thanks for considering performance improving. My use case is listing API statistics in our systems for a period of time so a result with thousands rows are quite common. If you need any detailed information to help you , I'm glad to share.

rbetts · 2017-04-26T22:07:06Z

@jwilder commented out-of-band: "we’ve thought about switching to ffjson or some non-reflection based json marhaller. related: #4363"

stuartcarnie · 2017-04-26T22:40:02Z

I used easyjson for serialization / deserialization of billions of JSON messages over web sockets. Their benchmarks are pretty comprehensive too.

stuartcarnie · 2017-04-26T23:06:19Z

In addition, is it worth adding support for RFC 7464: JavaScript Object Notation (JSON) Text Sequences.

If the client includes an Accept: application/json-seq header, the server would write data in a similar way to the csv writer (header row followed by values).

{ "name": "cluster", "columns": ["time", "clusterID", "copyShardReq", "createIteratorReq", "expandSourcesReq", "fieldDimensionsReq", "hostname", "nodeID", "removeShardReq", "writeShardFail", "writeShardPointsReq"] }
["2017-04-24T23:59:10Z","535175895417456351",0,0,0,0,"stuart-influx.local","data-0:8088",0,0,0,0]
["2017-04-24T23:59:20Z","535175895417456351",0,0,0,0,"stuart-influx.local","data-0:8088",0,0,0,0]

jsternberg · 2017-04-26T23:11:00Z

I don't actually think there is a marshaling problem. I tried to make marshaling faster back when I looked at this awhile ago, but it didn't speed up real performance because the limiting factors weren't based on marshaling.

Further, we should consider moving away from JSON and just supporting JSON as a debugging mechanism. JSON causes a bunch of other unrelated issues since it can't differentiate between floats and ints and it cannot accurately represent integers above 2^53. Since we support 2^63 with signed 64-bit integers, this is a bit inconvenient.

stuartcarnie · 2017-04-27T16:23:52Z

Agreed – we'll start with code-generated serialization

stuartcarnie · 2017-06-12T04:29:52Z

I have added results of some detailed analysis here. I used easyjson to generate serialization methods for the core types involved in /query responses. I chose easyjson, as I have used it in a previous production project.

seebs · 2018-10-20T02:13:51Z

Once long ago, I was annoyed by things to do with JSON encoding and decoding when talking to influxdata, and I was thinking about that and thought "this has to be expensive, right?", and thus I somehow ended up here, because I stumbled across the csv/msgpack support on the server side, which don't have corresponding functionality on the client side, which is why I never noticed that they existed.

I was a little surprised at the relatively small magnitude of the improvements from improving the marshalling code, but after looking at it more closely, I think this is probably a data structure issue.

10k rows with three values in each row is 10k slices of three interface{}, each of which in turn has to have a pointer to an underlying object. This is a fairly large volume of pointers for the GC to track, and it probably means they're all different allocations; I'm pretty sure that they will be after unmarshalling, in any event.

Performance might be improved by coalescing the allocations for the backing store, but my intuition is that if you really want to reduce the memory/CPU overhead much, you'd need to switch to slices of underlying types. So, instead of each row having a slice of interface{}, each series would have a slice of interface{} -- each member of which would be a slice of a concrete type, holding the concrete values for one column. So, if you have a column of timestamps, that would be a single slice of 10,000 time.Time, instead of 10,000 individual interface{} each wrapping a time.Time.

seebs · 2018-10-26T03:34:22Z

FWIW, on my system, dumping the same exact largeish (~1.5M row) query (curl ... > /dev/null) takes about 10 seconds with JSON, and 5 seconds with msgpack. In both cases, almost the entire time passes before any data is sent, which may also be inefficient. (But also, JSON is being compressed, and msgpack isn't, which costs a lot of CPU to save a lot of bandwidth.)

(Also, I was about to say "oh, and nevermind, ints and pointers can be inlined in interfaces", but they actually can't since apparently around 1.5, because that caused problems for gc.)

stale · 2019-07-23T23:33:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2019-07-31T00:29:23Z

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

jsternberg added area/queries area/performance labels Sep 1, 2016

jsternberg self-assigned this Sep 3, 2016

jwilder added this to the 1.1.0 milestone Sep 16, 2016

jwilder modified the milestones: 1.2.0, 1.1.0 Oct 6, 2016

jsternberg removed their assignment Nov 29, 2016

jwilder removed this from the 1.2.0 milestone Jan 3, 2017

rbetts assigned stuartcarnie Apr 26, 2017

rbetts added the proposed label Apr 26, 2017

dgnorton added the 1.x label Jan 7, 2019

stale bot added the wontfix label Jul 23, 2019

stale bot closed this as completed Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance on JSON serialization when using HTTP query #7250

Poor performance on JSON serialization when using HTTP query #7250

jffifa commented Sep 1, 2016 •

edited

Loading

jsternberg commented Sep 1, 2016

jsternberg commented Sep 12, 2016

jffifa commented Sep 23, 2016 •

edited

Loading

rbetts commented Apr 26, 2017

stuartcarnie commented Apr 26, 2017

stuartcarnie commented Apr 26, 2017

jsternberg commented Apr 26, 2017

stuartcarnie commented Apr 27, 2017

stuartcarnie commented Jun 12, 2017

seebs commented Oct 20, 2018

seebs commented Oct 26, 2018

stale bot commented Jul 23, 2019

stale bot commented Jul 31, 2019

Poor performance on JSON serialization when using HTTP query #7250

Poor performance on JSON serialization when using HTTP query #7250

Comments

jffifa commented Sep 1, 2016 • edited Loading

jsternberg commented Sep 1, 2016

jsternberg commented Sep 12, 2016

jffifa commented Sep 23, 2016 • edited Loading

rbetts commented Apr 26, 2017

stuartcarnie commented Apr 26, 2017

stuartcarnie commented Apr 26, 2017

jsternberg commented Apr 26, 2017

stuartcarnie commented Apr 27, 2017

stuartcarnie commented Jun 12, 2017

seebs commented Oct 20, 2018

seebs commented Oct 26, 2018

stale bot commented Jul 23, 2019

stale bot commented Jul 31, 2019

jffifa commented Sep 1, 2016 •

edited

Loading

jffifa commented Sep 23, 2016 •

edited

Loading