-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 #9534
Comments
@ptitou Can you provide some details about the hardware? e.g. How many cores, how much RAM, and are you running SSDs? Does this occur on every startup or just the first time? |
Hardware Info It occurs on every startup, even on other server (tested on a VM with SSD, we have same duration). |
@ptitou How many series do you have in the shard? |
Hi, I'm working on this issue with @ptitou.
|
@max3163 Thank you. Can you start up $ influxd run -cpuprofile cpu.pprof -memprofile mem.pprof Then zip the |
Also, 1.5.0 takes MUCH more RAM, even using tsi1 index. I guess I'll have to switch back to 1.4.3 =\ |
Find in attachment the prrof file as ask : We have also twice more RAM usage on the influxdb 1.5 : |
@max3163 @gabrielmocan Thanks for the info. We've consistently seen a drop in memory usage so I'm curious as to why your memory has gone up. I'm taking a look at the profiles now. |
@gabrielmocan One thing to note is that enabling TSI will only affect new shards. You can build TSI indexes for existing shards using the |
@max3163 It looks like it's still using the Can you show the output of |
@benbjohnson I've migrated from 1.4.3 and the shards were already tsi1 indexed. |
@gabrielmocan Unfortunately, you'll need to rebuild the TSI indexes when migrating to 1.5. Can you try that and let me know if it improves your memory usage? |
As indicate in the title of the issue, the problem is when we use the "inmem" engine yes. Sry I don't have access to the server for now, I will send you the output of the tree command only Monday. |
@max3163 I apologize. I got sidetracked on the TSI side discussion. Disregard my |
Sorry for disrupting the topic guys. |
No problem ! and don't hesitate to ask me more information if needed. |
@max3163 I'm still looking into the issue. From the cpu profile it looks like there's not much CPU usage. The profile duration was |
@benbjohnson the disks are attached to server with a RAID5 DELL controller. I confirm there is plenty IOwait during startup in 1.5.0. I attach also a zip with the cpu.pprof and mem.pprof files for 1.4.2 startup to compare with 1.5.0 and find if there concurrent access which generate IOwait, and see where the startup process in 1.5.0 differ. |
If we look the disk io graph and calculate the total of data reads, we remark that influxdb 1.5.0 reads about 480GB of data while files on disk is about 240GB (twice less) ! Maybe a track for memory occupation bigger in 1.5.0 than 1.4.2 ? |
I have large constant I/O 122MB/s and 955 IOPS on READ on 1000GB GP2 EBS. |
@ptitou @szibis Do you have full startup logs that you can share? Also, how frequently are you adding new series? 1.5.0 uses a new series file which is persisted to disk that lets us scale to a higher number of series and we memory-map the data into the process. |
Here the startup log files for v1.4.2 and v1.5.0 Also the number of series for the last 7days The tests were made on a snapshot of the production database, so there no new data between startup tests. |
@ptitou Can you send the output of this command? It could be slow if your series data isn't being held in-memory. That doesn't seem like the case since you have 64GB of RAM but I want to double check. $ find $DATADIR -type d | grep _series | xargs ls -l Also, are you running this in a container or restarting the host between startups? |
here the output of the find command The database is running on a physical server, and there is no restart of the host between startups, and no container or other VM. |
@ptitou Sorry for the delay. We're having trouble reproducing the issue. Are you able to share any of your data files? Or can you run |
I don't have the shard qualif/default/1773 but if I launch the command on another shard, first time with v1.5.0 and second time with v1.4.3 here is the result : launch time from logs
v.1.5.0
1.4.3
Don't think if it can help you. If you want some shard, do you have an FTP server where I can put this one (6Gb of data) ? |
@ptitou |
@ptitou OK, this is weird. I've tried on several machines and I can't reproduce the issue. I even tried it on a limited 1 vcpu/1GB RAM machine: v1.5.0
v1.4.3
I accidentally fired up an Ubuntu instance instead of CentOS but I'm firing up a CentOS 7 instance right now to double check. |
OK, it looks like it is an issue on CentOS. I'm not seeing quite the same disparity but it's still large. 1.4.3 is about |
We've got the same result on Red Hat Enterprise Linux Server release 6.4 (Santiago) and on CentOS Linux release 7.2.1511 (Core) as explained on the first comment if it can help you. |
@ptitou I tracked it down to a change in Thanks for all the help tracking this issue down. |
👍 thanks for your perseverence ! I will try the patch and tell you if it's ok for us ! |
Hi, I have tested 1.5.1 on Debian and I see also long time for restarts. Influxdb goes through all the shards and disk is 100% busy reading. Opening each file is taking from several seconds to more than 1 minute as shown in the logs
First attempt is a restart with Is there a reason why it tries to open all the shards on restart? The dataset is of 2.8 Tb, with 2 million series distributed around 10 databases of different sizes |
@ptitou |
The query is : SELECT last("numSeries") AS "numSeries" FROM "database" WHERE ("hostname" = '$host') AND $timeFilter GROUP BY time($__interval) fill(null) From _internal datasource |
Test done today with the nightly release, everything work fine ! Do you plan to release a version 1.5.2 with this commit (as explained in issue #9614 ) or do we have to wait release 1.6 ? |
@ptitou Sorry for the delay. This fix is in 1.5.2. |
I reopen this issue #9486 because i've the same behaviour with final 1.5.0 version.
Bug report
System info:
__ OS __ :
Red Hat Enterprise Linux Server release 6.4 (Santiago)
or
CentOS Linux release 7.2.1511 (Core)
Steps to reproduce:
Expected behavior: [What you expected to happen]
Same startup time in all versions
Actual behavior: [What actually happened]
We made a test in 1.5.0 with tsi1 enabled (after influx_inspect buildtsi command) and we have the same result.
The text was updated successfully, but these errors were encountered: