Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.1] Heavy memory footprint regression from 1.8 inmem to 2.1 #22936

Closed
sahib opened this issue Nov 25, 2021 · 7 comments
Closed

[2.1] Heavy memory footprint regression from 1.8 inmem to 2.1 #22936

sahib opened this issue Nov 25, 2021 · 7 comments
Assignees

Comments

@sahib
Copy link
Contributor

sahib commented Nov 25, 2021

Hello again,

I'm in the process of upgrading my 1.8 db to 2.1.1. Background is that we as a company
hope to get rid of the issue we had with upgrading to the tsi1 index in 1.8.

The upgrade works fine on my test setup and on a machine with small amount of
data. Once deployed to a staging environment though, which has an equal
amount of data as our prod environments, the problems start to show up. We're
talking about roughly 30G of data (measured by directory size), so rather
a medium workload for Influx.

First problem is that the update is abysmal slow. It takes about 2 hours until
it does anything - I had to increase INFLUXD_INIT_PING_ATTEMPTS to 1000000 so
that the automated docker upgrade doesn't kill the process. Before that time it
writes only minimal amounts of data to disk, then it suddenly starts to write
large quantities of data in a few minutes. During the upgrade it wrote roughly
30G of data, which seems reasonable. The upgrade part is not the real issue
here, I just was a bit surprised by the performance. After all it makes the
conversion harder, especially if you have more than one instance. The whole upgrade
took roughly 4 hours.

Second (and actual) problem is that after starting influxd it consumes
insane amounts of memory. 1.8 happily ran with 8GB of memory usage (even though
we had inmem as index). 2.1 seemed to eat RAM endlessly once it started:
I had a on a 32GB machine and I still had to add 10GB of swap, otherwise the
startup would fail. From the logs one could see that it was re-indexing data,
during which influxd was also not taking any queries. Swapping might have
slowed down things, but still it took roughly 10 hours until influxd
started accepting queries. Also compared to 1.8's inmem index the memory
usage is at least 5 times as high in our case. And that's on a machine that
does not see a lot of traffic. One issue we wanted to solve is the slow startup
time of influx 1.8 (due to rebuilding the inmem index). This also seemed to
got worse, since it every start of influx seems to still rebuild the index.

Similar tickets (for 1.8 though, the first one opened by myself):

Any ideas on how to debug this? That performance metrics do not really make it
possible to update to 2.1 any time soon. Is there something obvious (like some config options)
that I'm missing?


Steps to reproduce:

  • Upgrade from 1.8 to 2.1 with 30of data as described in the docs
    (and the README of the docker image)

Expected behavior:

  • A smooth upgrade process in a few minutes up to one hour.
  • Sane memory usage after the upgrade process.

Actual behavior:

See description above.

Environment info:

  • Linux 5.4.0-1045-aws aarch64
  • InfluxDB 2.1.1 (git: 657e1839de) build_date: 2021-11-09T03:03:48Z
  • I use the *-alpine variant of the docker images.
  • The cardinality of our series is 4206 (as shown by SHOW SERIES CARDINALITY),
    which does not seem that high...

Config:

Completely standard from what I can see:

bolt-path = "/var/lib/influxdb2/influxd.bolt"
engine-path = "/var/lib/influxdb2/engine"
log-level = "warn"
storage-max-concurrent-compactions = 1
storage-series-id-set-cache-size = 100

Logs:

These logs are visible after influxd start:

influx_1             | ts=2021-11-23T13:52:59.341370Z lvl=info msg="Reindexing WAL data" log_id=0X~G2Sxl000 service=storage-engine engine=tsm1 db_shard_id=38769
influx_1             | ts=2021-11-23T13:52:59.341401Z lvl=info msg="Opened shard" log_id=0X~G2Sxl000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/var/lib/influxdb2/engine/data/ba958bde7c3ad6aa/autogen/38769 duration=159.948ms
influx_1             | ts=2021-11-23T13:52:59.397790Z lvl=info msg="index opened with 8 partitions" log_id=0X~G2Sxl000 service=storage-engine index=tsi
influx_1             | ts=2021-11-23T13:52:59.419228Z lvl=info msg="Opened file" log_id=0X~G2Sxl000 service=storage-engine engine=tsm1 service=filestore path=/var/lib/influxdb2/engine/data/ba958bde7c3ad6aa/autogen/38663/000000001-000000002.tsm id=0 duration=3.966ms
influx_1             | ts=2021-11-23T13:52:59.419500Z lvl=info msg="Reindexing TSM data" log_id=0X~G2Sxl000 service=storage-engine engine=tsm1 db_shard_id=38663
influx_1             | ts=2021-11-23T13:52:59.446757Z lvl=info msg="index opened with 8 partitions" log_id=0X~G2Sxl000 service=storage-engine index=tsi

Once restarted, one can see that this command takes quite a bit time before starting influx.
The user is already set correctly, but it seems that directory contains over one million files.
Just counting them via find took a few minutes - might that be the issue?

$ find /var/lib/influxdb2 ! -user influxdb -exec chown influxdb {}
@jo-me
Copy link

jo-me commented Dec 8, 2021

Do you see the same when upgrading to 2.0.9?

@sahib
Copy link
Contributor Author

sahib commented Dec 8, 2021

Do you see the same when upgrading to 2.0.9?

Yes, seems like the behavior is the same as 2.1.1. Slow upgrade, massive memory usage. Did you suspect that 2.0.9 had better performance for a specific reason? Note that we had also very similar issues with 1.8.x and the tsi1 index. I think the issue is really that we hit an edge case regarding performance here, probably related to the massive amount of files that are created.

@jo-me
Copy link

jo-me commented Dec 9, 2021

We noticed the high (unlimited) memory usage as well after the upgrade to 2.1.1 but i guess it was a coincidence then as we're only starting to pour data into it.

@sahib
Copy link
Contributor Author

sahib commented Dec 9, 2021

FYI: We also tried the recommendations (specifically GODEBUG=madvdontneed=1) from here to no avail. The behavior stays the same overall, it just uses slightly less memory. If the problem persists we gonna have to switch database technology long term.

@lesam
Copy link
Contributor

lesam commented Feb 4, 2022

@sahib there was a change present in 2.0.9 and forward (including 2.1.1) that improved TSI memory consumption: #22334 .

You may also be hitting #23085 - do you have any scrapers configured?

Finally, @sahib @jo-me would either of you be able to share profiles? You can collect a full set of profiles with something like curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all" . I want to figure out what happened to cause this for you, if you want to have some higher bandwith conversation you can reach out to me on our community slack: https://www.influxdata.com/blog/introducing-our-new-influxdata-community-slack-workspace/

@lesam lesam self-assigned this Feb 4, 2022
@sahib
Copy link
Contributor Author

sahib commented Feb 5, 2022

@lesam Thanks for your response 🙏

@sahib there was a change present in 2.0.9 and forward (including 2.1.1) that improved TSI memory consumption: #22334 .

I just checked the git log, and the commit mentioned in that PR was already in the version I tested with. Or do you mean that this change might be the issue?

You may also be hitting #23085 - do you have any scrapers configured?

No, we did not have any scrapers configured.

Regarding the profiles: Sorry, can't help with that anymore. We switched to Timescale roughly some weeks ago and don't have any InfluxDB instances running anymore. I really hope @jo-me can be of more service than me here. In hindsight, the thing that felt the strangest was the millions of files produced in the database directory.

@lesam
Copy link
Contributor

lesam commented Nov 16, 2022

No way to identify or reproduce this issue, and it may have been a duplicate of #23085 , so closing.

@lesam lesam closed this as completed Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants