[inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 #9534

ghost · 2018-03-09T14:38:41Z

I reopen this issue #9486 because i've the same behaviour with final 1.5.0 version.

Bug report

System info:

Branch	Build Time	Commit	Version
1.5	""	"6ac835404e7e64ea7299a6eebcce1ab1ef15fe3c"	"1.5.0"

__ OS __ :
Red Hat Enterprise Linux Server release 6.4 (Santiago)
or
CentOS Linux release 7.2.1511 (Core)

Steps to reproduce:

First Test in 1.3.5

[I] 2018-03-09T13:35:40Z InfluxDB starting, version 1.3.5, branch HEAD, commit 9d9001036d3585cf21925c13a57881bc6c8dcc7e
[I] 2018-03-09T13:35:40Z Go version go1.8.3, GOMAXPROCS set to 12
[I] 2018-03-09T13:35:40Z Using configuration at: /tech/influxdb/conf/influxdb.conf
...
[I] 2018-03-09T13:35:55Z /tech/influxdb/var/data/influxdb/data/telegraf/default/2323 opened in 8.036002793s service=store
...
[I] 2018-03-09T13:35:57Z Listening on HTTP:0.0.0.0:9001 service=httpd

Second Test in 1.4.2

[I] 2018-03-09T14:00:28Z InfluxDB starting, version 1.4.2, branch 1.4, commit 6d2685d1738277a1c2672fc58df7994627769be6
[I] 2018-03-09T14:00:28Z Go version go1.9.2, GOMAXPROCS set to 12
[I] 2018-03-09T14:00:28Z Using configuration at: /tech/influxdb/conf/influxdb.conf
...
[I] 2018-03-09T14:00:44Z /tech/influxdb/var/data/influxdb/data/telegraf/default/2323 opened in 8.537640076s service=store
...
[I] 2018-03-09T14:00:47Z Listening on HTTP:0.0.0.0:9001 service=httpd

Third test in 1.5.0

ts=2018-03-09T13:43:45.145501Z lvl=info msg="InfluxDB starting" log_id=06jktnzG000 version=1.5.0 branch=1.5 commit=6ac835404e7e64ea7299a6eebcce1ab1ef15fe3c
ts=2018-03-09T13:43:45.145557Z lvl=info msg="Go runtime" log_id=06jktnzG000 version=go1.9.2 maxprocs=12
...
ts=2018-03-09T13:54:22.970753Z lvl=info msg="Opened shard" log_id=06jktnzG000 service=store trace_id=06jktnzl000 op_name=tsdb_open path=/tech/influxdb/var/data/influxdb/data/telegraf/default/2323 duration=251247.336ms
...
ts=2018-03-09T13:54:33.583256Z lvl=info msg="Listening on HTTP" log_id=06jktnzG000 service=httpd addr=0.0.0.0:9001 https=false

Expected behavior: [What you expected to happen]

Same startup time in all versions

Actual behavior: [What actually happened]

Release	Total startup time	One shard open time
1.3.5	17s	8.036002793s
1.4.2	19s	8.537640076s
1.5.0	11min	251247.336ms

We made a test in 1.5.0 with tsi1 enabled (after influx_inspect buildtsi command) and we have the same result.

The text was updated successfully, but these errors were encountered:

benbjohnson · 2018-03-09T15:58:11Z

@ptitou Can you provide some details about the hardware? e.g. How many cores, how much RAM, and are you running SSDs?

Does this occur on every startup or just the first time?

ghost · 2018-03-09T16:15:11Z

Hardware Info
1 CPU Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (6 cores / 12 threads)
MemTotal: 65862900 kB (64Gb)
Disk : Hardware RAID5 of 3 HDD (total of 1Tb)

It occurs on every startup, even on other server (tested on a VM with SSD, we have same duration).

benbjohnson · 2018-03-09T16:20:57Z

@ptitou How many series do you have in the shard?

max3163 · 2018-03-09T16:30:19Z

Hi,

I'm working on this issue with @ptitou.
See below more details on series about the shard :

Summary :
File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000005.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 17126   File Size: 2147502592

File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000006.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 217919   File Size: 2147702233

File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000007.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 147602   File Size: 2147633909

File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000008.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 17395   File Size: 273550172

benbjohnson · 2018-03-09T16:58:23Z

@max3163 Thank you. Can you start up influxd with some additional flags and SIGINT (CTRL-C) it after it loads the shards?

$ influxd run -cpuprofile cpu.pprof -memprofile mem.pprof

Then zip the cpu.pprof and mem.pprof files up and drop them here in this GitHub issue.

gabrielmocan · 2018-03-09T17:13:45Z

Also, 1.5.0 takes MUCH more RAM, even using tsi1 index. I guess I'll have to switch back to 1.4.3 =\

max3163 · 2018-03-09T17:29:18Z

Find in attachment the prrof file as ask :

pprof.zip

We have also twice more RAM usage on the influxdb 1.5 :

benbjohnson · 2018-03-09T18:04:17Z

@max3163 @gabrielmocan Thanks for the info. We've consistently seen a drop in memory usage so I'm curious as to why your memory has gone up. I'm taking a look at the profiles now.

benbjohnson · 2018-03-09T18:25:32Z

@gabrielmocan One thing to note is that enabling TSI will only affect new shards. You can build TSI indexes for existing shards using the buildtsi utility: http://docs.influxdata.com/influxdb/v1.5/tools/influx_inspect/#influx-inspect-buildtsi

benbjohnson · 2018-03-09T18:31:22Z

@max3163 It looks like it's still using the inmem index from the profile: https://rawgit.com/benbjohnson/020eff9976185bc186f5d0cf7b469fa1/raw/17fde3e2394dcf3231088c32c083a3a8972a1bbd/mem.svg

Can you show the output of tree on your shard's data directory?

gabrielmocan · 2018-03-09T18:34:21Z

@benbjohnson I've migrated from 1.4.3 and the shards were already tsi1 indexed.

benbjohnson · 2018-03-09T18:50:47Z

@gabrielmocan Unfortunately, you'll need to rebuild the TSI indexes when migrating to 1.5. Can you try that and let me know if it improves your memory usage?

max3163 · 2018-03-09T19:00:07Z

@benbjohnson

As indicate in the title of the issue, the problem is when we use the "inmem" engine yes.
We have made only one test to check with the tsi1 index on an other instance (with tool influxdb_inspect as indicate in the doc to migrate data) but we had same result for the loading time.

Sry I don't have access to the server for now, I will send you the output of the tree command only Monday.

benbjohnson · 2018-03-09T19:15:29Z

@max3163 I apologize. I got sidetracked on the TSI side discussion. Disregard my tree request.

gabrielmocan · 2018-03-09T19:16:13Z

Sorry for disrupting the topic guys.

max3163 · 2018-03-09T19:53:03Z

@benbjohnson

No problem ! and don't hesitate to ask me more information if needed.

benbjohnson · 2018-03-13T18:03:53Z

@max3163 I'm still looking into the issue. From the cpu profile it looks like there's not much CPU usage. The profile duration was 10.63s but there were only 62.24s of samples which makes me think it's waiting on disk. Do have startup logs for your SSD drives? Also, are these drives connected directly or mounted over NFS?

https://rawgit.com/benbjohnson/020eff9976185bc186f5d0cf7b469fa1/raw/731369cc54d655f61cbf86d6ede464a1fa0f6876/cpu.svg

ghost · 2018-03-14T10:00:37Z

@benbjohnson the disks are attached to server with a RAID5 DELL controller. I confirm there is plenty IOwait during startup in 1.5.0.
Here a graph where we start v1.4.2, and then v1.5.0 :

I attach also a zip with the cpu.pprof and mem.pprof files for 1.4.2 startup to compare with 1.5.0 and find if there concurrent access which generate IOwait, and see where the startup process in 1.5.0 differ.
v1.4.2.zip

ghost · 2018-03-14T10:07:26Z

If we look the disk io graph and calculate the total of data reads, we remark that influxdb 1.5.0 reads about 480GB of data while files on disk is about 240GB (twice less) !

Maybe a track for memory occupation bigger in 1.5.0 than 1.4.2 ?

szibis · 2018-03-14T10:16:03Z

I have large constant I/O 122MB/s and 955 IOPS on READ on 1000GB GP2 EBS.
This is very slow.

szibis · 2018-03-14T11:20:03Z

1.4.3 startup vs 1.5.0 startup

benbjohnson · 2018-03-15T17:30:15Z

@ptitou @szibis Do you have full startup logs that you can share? Also, how frequently are you adding new series? 1.5.0 uses a new series file which is persisted to disk that lets us scale to a higher number of series and we memory-map the data into the process.

ghost · 2018-03-16T09:25:16Z

Here the startup log files for v1.4.2 and v1.5.0
influxdb_log.zip

Also the number of series for the last 7days
.

The tests were made on a snapshot of the production database, so there no new data between startup tests.

benbjohnson · 2018-03-16T15:42:47Z

@ptitou Can you send the output of this command? It could be slow if your series data isn't being held in-memory. That doesn't seem like the case since you have 64GB of RAM but I want to double check.

$ find $DATADIR -type d | grep _series | xargs ls -l

Also, are you running this in a container or restarting the host between startups?

ghost · 2018-03-16T16:01:51Z

here the output of the find command
find.zip

The database is running on a physical server, and there is no restart of the host between startups, and no container or other VM.

benbjohnson · 2018-03-19T17:31:04Z

@ptitou Sorry for the delay. We're having trouble reproducing the issue. Are you able to share any of your data files? Or can you run influx_inspect report? In particular, shard qualif/default/1773 seems to have a huge disparity between 1.4.3 & 1.5.0 (7s vs 360,000s).

ghost · 2018-03-20T17:06:16Z

I don't have the shard qualif/default/1773 but if I launch the command on another shard, first time with v1.5.0 and second time with v1.4.3 here is the result :

launch time from logs

1.4.2 : /tech/influxdb/var/data/influxdb/data/qualif/default/2175 5.114586342s
1.5.0 : /tech/influxdb/var/data/influxdb/data/qualif/default/2175 152163.862ms

v.1.5.0

-bash-4.1$ /tech/influxdb/influxdb-1.5.0/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
qualif  default 2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 4.669501721s
qualif  default 2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 4.486162495s
qualif  default 2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 3.656177153s

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s

Statistics
  Series:
     - qualif (est): 95010 (100%)
  Total (est): 95010
Completed in 12.85635712s

1.4.3

-bash-4.1$ /tech/influxdb/influxdb-1.4.2-1/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
qualif  default 2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 7.092426ms
qualif  default 2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 4.421906ms
qualif  default 2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 14.351637ms

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s

Statistics
  Series:
     - qualif (est): 95010 (100%)
  Total (est): 95010
Completed in 95.538358ms

Don't think if it can help you. If you want some shard, do you have an FTP server where I can put this one (6Gb of data) ?

tyauvil · 2018-03-20T20:31:17Z

@ptitou
Please send an email to support@influxdata.com so we can share upload information.

benbjohnson · 2018-03-21T17:34:23Z

@ptitou OK, this is weird. I've tried on several machines and I can't reproduce the issue. I even tried it on a limited 1 vcpu/1GB RAM machine:

v1.5.0

$ wget https://dl.influxdata.com/influxdb/releases/influxdb-1.5.1_linux_amd64.tar.gz
$ tar xzvf influxdb-1.5.1_linux_amd64.tar.gz
$ influxdb-1.5.0-1/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
        root    2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 6.088269ms
        root    2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 3.050018ms
        root    2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 16.402941ms

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s 

Statistics
  Series:
     -  (est): 95010 (100%)
  Total (est): 95010
Completed in 184.327883ms

v1.4.3

$ wget https://dl.influxdata.com/influxdb/releases/influxdb-1.4.3_linux_amd64.tar.gz
$ tar xzvf influxdb-1.4.3_linux_amd64.tar.gz
$ influxdb-1.4.3-1/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
        root    2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 7.225517ms
        root    2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 6.398312ms
        root    2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 10.399071ms

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s 

Statistics
  Series:
     -  (est): 95010 (100%)
  Total (est): 95010
Completed in 186.647269ms

I accidentally fired up an Ubuntu instance instead of CentOS but I'm firing up a CentOS 7 instance right now to double check.

benbjohnson · 2018-03-21T17:48:29Z

OK, it looks like it is an issue on CentOS. I'm not seeing quite the same disparity but it's still large. 1.4.3 is about 200ms and 1.5.0 is 2s.

ghost · 2018-03-21T17:55:43Z

We've got the same result on Red Hat Enterprise Linux Server release 6.4 (Santiago) and on CentOS Linux release 7.2.1511 (Core) as explained on the first comment if it can help you.

benbjohnson · 2018-03-21T20:36:15Z

@ptitou I tracked it down to a change in madvise() that only seems to be affecting CentOS/RHEL. We are reverting it here: #9612

Thanks for all the help tracking this issue down.

ghost · 2018-03-21T21:12:09Z

👍 thanks for your perseverence ! I will try the patch and tell you if it's ok for us !

xo4n · 2018-03-21T22:36:06Z

Hi, I have tested 1.5.1 on Debian and I see also long time for restarts. Influxdb goes through all the shards and disk is 100% busy reading. Opening each file is taking from several seconds to more than 1 minute as shown in the logs

ts=2018-03-21T22:21:28.949409Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000007.tsm id=4 duration=44701.636ms
ts=2018-03-21T22:21:28.949478Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000015.tsm id=12 duration=35986.227ms
ts=2018-03-21T22:21:28.960476Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000012.tsm id=9 duration=44415.333ms
ts=2018-03-21T22:21:29.066193Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000013.tsm id=10 duration=44520.889ms
ts=2018-03-21T22:21:29.505564Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000017.tsm id=14 duration=36529.187ms
ts=2018-03-21T22:21:52.160085Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/2444 duration=132423.805ms
ts=2018-03-21T22:21:52.160776Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/3183 duration=199630.454ms
ts=2018-03-21T22:21:52.161794Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/3211 duration=68156.950ms

First attempt is a restart with index-version = "tsi1" after running influx_inspect buildtsi -datadir data -waldir wal offline. Current attempt is running with index-version = "inmem" after deleting all the index directories. Both cases is taking hours to restart

Is there a reason why it tries to open all the shards on restart? The dataset is of 2.8 Tb, with 2 million series distributed around 10 databases of different sizes

xo4n · 2018-03-21T23:21:19Z

Just as an additional note in case it might be helpful for some one else. The screenshot below shows the difference between restarting v1.5.1 with tsi1 vs inmem. The latest takes less time, consumes less memory and looks more stable. the first restart with tsi1 took longer and it started swapping at the end

markusressel · 2018-03-22T03:37:56Z

@ptitou
Sorry for the noob question here, but may I ask what query you used to get the total count of series like shown in your screenshot?

max3163 · 2018-03-22T08:53:56Z

@markusressel

The query is :

SELECT last("numSeries") AS "numSeries" FROM "database" WHERE ("hostname" = '$host') AND $timeFilter GROUP BY time($__interval) fill(null)

From _internal datasource

ghost · 2018-03-22T09:35:51Z

Test done today with the nightly release, everything work fine !

Do you plan to release a version 1.5.2 with this commit (as explained in issue #9614 ) or do we have to wait release 1.6 ?

benbjohnson · 2018-04-19T18:51:43Z

@ptitou Sorry for the delay. This fix is in 1.5.2.

ghost changed the title ~~[inmem] Startup in 1.5.0 is slower than in 1.3.5 and 1.4.2~~ [inmem] Startup in 1.5.0 is very slower than in 1.3.5 and 1.4.2 Mar 9, 2018

benbjohnson added the need more info label Mar 9, 2018

benbjohnson self-assigned this Mar 9, 2018

dvdjaco mentioned this issue Mar 13, 2018

Higher Memory Usage in 1.5 with TSI indexes #9540

Closed

jwilder added the difficulty/medium Resolving this issue should take up to a week label Mar 13, 2018

benbjohnson mentioned this issue Mar 21, 2018

Revert "Use MADV_WILLNEED when loading TSM files" #9612

Merged

ragzilla mentioned this issue Mar 23, 2018

InfluxDB 1.5.0 not listening to http after starting the process #9550

Closed

e-dard added this to the 1.5.2 milestone Mar 27, 2018

timhallinflux changed the title ~~[inmem] Startup in 1.5.0 is very slower than in 1.3.5 and 1.4.2~~ [inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 Mar 29, 2018

binary0111 mentioned this issue Apr 14, 2018

[tsi1 Index] Upgrading from 1.4.x to 1.5.2 #9719

Closed

benbjohnson closed this as completed Apr 19, 2018

e-dard mentioned this issue Jul 23, 2018

Consider making TSM MADV_WILLNEED a configuration option #10117

Closed

[inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 #9534

[inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 #9534

Comments

ghost commented Mar 9, 2018

Bug report

benbjohnson commented Mar 9, 2018

ghost commented Mar 9, 2018

benbjohnson commented Mar 9, 2018

max3163 commented Mar 9, 2018 • edited Loading

benbjohnson commented Mar 9, 2018

gabrielmocan commented Mar 9, 2018

max3163 commented Mar 9, 2018 • edited Loading

benbjohnson commented Mar 9, 2018

benbjohnson commented Mar 9, 2018

benbjohnson commented Mar 9, 2018

gabrielmocan commented Mar 9, 2018

benbjohnson commented Mar 9, 2018

max3163 commented Mar 9, 2018

benbjohnson commented Mar 9, 2018

gabrielmocan commented Mar 9, 2018

max3163 commented Mar 9, 2018

benbjohnson commented Mar 13, 2018

ghost commented Mar 14, 2018

ghost commented Mar 14, 2018

szibis commented Mar 14, 2018

szibis commented Mar 14, 2018

benbjohnson commented Mar 15, 2018

ghost commented Mar 16, 2018

benbjohnson commented Mar 16, 2018

ghost commented Mar 16, 2018

benbjohnson commented Mar 19, 2018

ghost commented Mar 20, 2018 • edited by ghost Loading

tyauvil commented Mar 20, 2018

benbjohnson commented Mar 21, 2018

v1.5.0

v1.4.3

benbjohnson commented Mar 21, 2018

ghost commented Mar 21, 2018

benbjohnson commented Mar 21, 2018

ghost commented Mar 21, 2018

xo4n commented Mar 21, 2018

xo4n commented Mar 21, 2018

markusressel commented Mar 22, 2018

max3163 commented Mar 22, 2018 • edited Loading

ghost commented Mar 22, 2018

benbjohnson commented Apr 19, 2018

max3163 commented Mar 9, 2018 •

edited

Loading

max3163 commented Mar 9, 2018 •

edited

Loading

ghost commented Mar 20, 2018 •

edited by ghost

Loading

max3163 commented Mar 22, 2018 •

edited

Loading