Show tag keys performance on 1.4.1 (tsi1) #9125

reistiago · 2017-11-16T16:40:02Z

Bug report

System info:
influxDB 1.4.1 with tsi1
Amazon Linux AMI release 2015.09 (centos based)
Machine type: r4.4xlarge

Description:

We have a high cardinality use case, as can be seen by:

SELECT numSeries FROM "_internal".."database" WHERE time > now() - 10s GROUP BY "database" ORDER BY desc LIMIT 1

name: database
tags: database=something
time                numSeries
----                ---------
1510849360000000000 31199677

This is mostly due to a specific measurement.

On 1.3.7 when running "show tag keys from measurement" it executes in a few seconds.

We tried to upgrade to 1.4.1, to do so we did the following:

stop influx
delete the previous index files - given that this version has incompatible indexes as stated on Improve performance of TSI Bloom Filter #8857
use influx_inspect inmem2tsi to create the new index files
start influx

Expected behavior:

Have similar performance between 1.3.7 and 1.4.1

Actual behavior:

"show tag keys from measurement" takes around 10 minutes to execute.

Tried other lower cardinality measurements on the same database and the performance looks similar to 1.3.7

Additional info:

We are trying to build a way to reproduce the issue that we can share, possibly a script that generates similar cardinality but we haven't had time to finish it yet.

The text was updated successfully, but these errors were encountered:

e-dard · 2017-11-16T17:20:16Z

Hi @reistiago,

How many retention policies do you have? Could you provide SHOW RETENTION POLICIES ON "something"?

One of the bugs we fixed with SHOW TAG KEYS was previously it was only querying the default retention policy, where as in 1.4.0 it will query all retention policies in the database.

If you have other retention policies on your database that could be part of the reason why it's now taking longer than it did on 1.3.7. Especially if those retention policies contain the measurement.

P.S. we now have an explicit cardinality command, which is a bit easier to remember than the query for the _internal stats: SHOW SERIES CARDINALITY ON "something"

benbjohnson · 2017-11-16T17:20:20Z

@reistiago Thanks for the bug report. Can you run a profile while the query is running? That will help us narrow down what the issue is.

Run the query.
Go to http://YOURHOST:8086/debug/pprof/all?cpu=1. This request can take 30s to execute.

This will return a profiles.tar.gz file. Can you post that to this issue?

reistiago · 2017-11-16T17:28:07Z

@e-dard

Only the default (autogen):


> SHOW RETENTION POLICIES on something;
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

I know about SHOW SERIES CARDINALITY ON "something" but this is only available on 1.4.x right? We had to revert back to 1.3.7 otherwise the system was unusable (it relies on 'show tag keys').

@benbjohnson like I said above, we had to revert back to 1.3.7 so I can't really do that now, and it takes quite a long time (hours) to recreate the index based on our current dataset. I'll try to get it tomorrow.

e-dard · 2017-11-16T17:31:21Z

@reistiago OK, we understand. We can try and reproduce our side. Roughly how many measurements do you have in "something"? Tens, hundreds, thousands?

reistiago · 2017-11-16T17:37:16Z

@e-dard very few, around 10.

reistiago · 2017-11-17T14:11:31Z

@benbjohnson @e-dard

Attached a few pprof files while the query was executing, files name should be self explanatory.

Don't know how long did the query took to run because I got busy with something else in the meantime and didn't noticed when it finished. But it took longer then 23 minutes.

profiles-query-running-for-4-minutes.tag.gz
profiles-query-running-for-7-minutes.tar.gz
profiles-query-running-for-9-minutes.tar.gz
profiles-query-running-for-13-minutes.tar.gz
profiles-query-running-for-18-minutes.tar.gz
profiles-query-running-for-22-minutes.tar.gz
profiles-query-running-for-less-then-one-minute.tar.gz

In case this is useful, looking at the output of top, it looks like just one of the cores is being used, and it's spending most of the time waiting.


Tasks: 177 total,   1 running, 176 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  7.0%us,  0.3%sy,  0.0%ni,  0.0%id, 92.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  125833124k total, 118226788k used,  7606336k free,    80196k buffers
Swap:        0k total,        0k used,        0k free, 95164624k cached
...

Influx log while the query was running was mostly:

...
[I] 2017-11-17T12:58:00Z compact id=2287 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:01Z compact id=2083 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:01Z compact id=2288 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:01Z compact id=2287 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:02Z compact id=2083 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:02Z compact id=2288 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:02Z compact id=2287 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:03Z compact id=2083 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:03Z compact id=2288 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:03Z compact id=2287 (0/0) (0/0) (0/0) (0/0) engine=tsm1
[I] 2017-11-17T12:58:04Z compact id=2083 (0/0) (0/0) (0/0) (0/0) engine=tsm1
...

Using the new method to calculate the cardinality returns:

> SHOW SERIES CARDINALITY ON something;
cardinality estimation
----------------------
31211064

Space occupied in disk for that database data folder: 64G

I've kept an image of this machine, so running further tests should be quicker in case you need more information.

e-dard · 2017-11-20T20:34:38Z

@reistiago do you have any idea how many tag keys you have in your database? Essentially:

$ influx -database "something" -exectute 'SHOW TAG KEYS FROM measurement' | wc -l

You could run it on 1.3.7; the answer would be the same. I'm trying to repro but after writing 30M series I still get almost instantaneous answers on 1.4.1. My working theory now is that you have many tag keys, rather than just many tag values.

e-dard · 2017-11-20T20:58:32Z

@reistiago Ah I think I have reproduced it... It's the number of shards you have that's the problem. It looks like you have around a 1,000 shards at the moment. I'll look into optimising the query when there are many shards.

However, (and this advice isn't merely to improve the performance of this query), you should really consider an alternative retention policy unless you're dropping data a lot.

Fewer shards are going to cause less stress on the system, and will significantly reduce the size of your data directory, because there will be fewer (but larger) tsi1 indexes. This results in an overall size reduction because there is less index overhead. Also, compression will likely be improved with fewer shards and there may well be a write performance improvement.

What's your current retention policy duration?

reistiago · 2017-11-20T21:15:57Z

@e-dard assuming that shard count, equals the number of folders in /data/{database}/{retention-policy} your guess is correct about the number of shards: 1043.

Only 9 tag keys on that measurement, so you are likely correct that the problem is not here.

We are using the default retention policy, so infinite.

Increasing the shard size / duration will probably cause slower reads correct?

reistiago · 2017-11-21T11:54:15Z

@e-dard sorry for the question, since it's only tangentially related with the issue.

If I alter the retention policy (autogen in this case) with:

alter retention policy "autogen" ON "something" SHARD DURATION 1000w

Is this retroactively applied to the shards that already exist? Or this must be done on db creation?

reistiago · 2017-11-24T19:22:25Z

@e-dard any news on this issue / pr?

e-dard · 2017-11-27T15:58:41Z

@reistiago a fix has been merged for this. I believe it will be back-ported into a 1.4.3 release, though I don't know exactly when the release will ship. I would expect within weeks though.

Regarding your other question—yes ALTER RETENTION POLICY only applies to future shards. Currently, restructuring your old shards has to be done using alternative tooling, e.g., with an export using influx_inspect and then a reimport.

reistiago · 2017-11-27T16:07:08Z

@e-dard I still have the machine image with the problem so will give it a try when available.

Thanks for the reply regarding the shards configuration.

reistiago · 2017-12-11T21:00:15Z

@e-dard this ended up not being back ported correct?

Also looking at the 1.5.0 change log it's not referenced yet (in master). I'm just confused on what's the status on this issue / pr.

jwilder added area/tsi area/tsidefault area/performance labels Nov 16, 2017

benbjohnson self-assigned this Nov 16, 2017

e-dard assigned e-dard and unassigned benbjohnson Nov 20, 2017

e-dard mentioned this issue Nov 21, 2017

SHOW TAG KEYS with high cardinality and many shards #9143

Merged

2 tasks

ghost added the review label Nov 21, 2017

e-dard closed this as completed in #9143 Nov 27, 2017

ghost removed the review label Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show tag keys performance on 1.4.1 (tsi1) #9125

Show tag keys performance on 1.4.1 (tsi1) #9125

reistiago commented Nov 16, 2017 •

edited

Loading

e-dard commented Nov 16, 2017

benbjohnson commented Nov 16, 2017

reistiago commented Nov 16, 2017 •

edited

Loading

e-dard commented Nov 16, 2017

reistiago commented Nov 16, 2017 •

edited

Loading

reistiago commented Nov 17, 2017

e-dard commented Nov 20, 2017

e-dard commented Nov 20, 2017

reistiago commented Nov 20, 2017 •

edited

Loading

reistiago commented Nov 21, 2017 •

edited

Loading

reistiago commented Nov 24, 2017

e-dard commented Nov 27, 2017

reistiago commented Nov 27, 2017

reistiago commented Dec 11, 2017 •

edited

Loading

Show tag keys performance on 1.4.1 (tsi1) #9125

Show tag keys performance on 1.4.1 (tsi1) #9125

Comments

reistiago commented Nov 16, 2017 • edited Loading

Bug report

e-dard commented Nov 16, 2017

benbjohnson commented Nov 16, 2017

reistiago commented Nov 16, 2017 • edited Loading

e-dard commented Nov 16, 2017

reistiago commented Nov 16, 2017 • edited Loading

reistiago commented Nov 17, 2017

e-dard commented Nov 20, 2017

e-dard commented Nov 20, 2017

reistiago commented Nov 20, 2017 • edited Loading

reistiago commented Nov 21, 2017 • edited Loading

reistiago commented Nov 24, 2017

e-dard commented Nov 27, 2017

reistiago commented Nov 27, 2017

reistiago commented Dec 11, 2017 • edited Loading

reistiago commented Nov 16, 2017 •

edited

Loading

reistiago commented Nov 16, 2017 •

edited

Loading

reistiago commented Nov 16, 2017 •

edited

Loading

reistiago commented Nov 20, 2017 •

edited

Loading

reistiago commented Nov 21, 2017 •

edited

Loading

reistiago commented Dec 11, 2017 •

edited

Loading