Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

Open
sahib opened this issue Oct 11, 2021 · 3 comments
Open

[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

sahib opened this issue Oct 11, 2021 · 3 comments

Comments

@sahib
Copy link
Contributor

sahib commented Oct 11, 2021

Hello,

I have a couple of weird issues restoring a backup made on our production
instance (v1.8.6) to one of our staging environments (v1.8.9). This is an
automated process in our case and worked until 2-3 weeks before.

Since I was hit by #21991 I updated from 1.8.6 to 1.8.9, otherwise I was not
able to even start the backup. Long term we want to upgrade to 2.0, but until
that happens we're stuck with 1.8.x for some time. The whole idea was to test
the switch to tsi1 on staging before doing it on our prod instance, therefore
the backup/restore cycle investigation listed below. But the "real" issue is that
we want to save some memory on our prod instance.

I'm happy to deliver more info if needed. Any idea how to progress here?


Steps to reproduce:

  1. Take backup of prod instance using influxd backup -portable /some/dir
  2. Transfer to staging instance using rsync.
  3. Try to restore using influxd restore -portable /var/lib/influxdb/backup using various config options.

(NOTE: actual commands are slightly longer, due to dockerized env, but effectively the same)

Expected behavior:

Restore would work with 1.8.6 or at least 1.8.9.

Actual behavior:

It does not.

  1. 1.8.6 restore fails immediately with an error message similar to the ones in this ticket Backup/restore fails with a lot of databases #9968.

  2. 1.8.9 seemingly first works, but eats huge amounts of data (I had to increase
    instance size to 32G + swap to progress further). After importing roughly 20G
    of data, it crashed with an out-of-memory error (see log). Despite the
    error it still had enough memory available.

  3. After setting the indexing back to "inmem" the excessive memory consumption was gone,
    but the restore still crashed halfway through (see other log).

  4. After reading this up, I followed a few suggestions which were:

    Add vm.max_map_count=2048000 to /etc/sysctl.conf and activate it.
    Set "max-concurrent-compactions" to 0.
    

    With this setup the restore worked (in the sense that the restore command
    returned successfully), but still produced a OOM error shortly after. After
    a restart of the influxd process the data was (mostly?) there though. I'm
    not 100% certain the two previous command did an effect, maybe just "luck".
    I forgot to save that log, but it looked pretty much like the previous ones,
    except different timestamps.

  5. When trying to restart now with tsi1 enabled, the insane memory consumption happens
    again. This seems to be a more general issue in our case.

In all cases starting from 2 to 4 I also see plenty of those logs:

lvl=warn msg="Error while freeing cold shard resources" service=storeerror="engine is closed" db_shard_id=23510

Environment info:

  • Linux 5.4.0-1029-aws x86_64
  • InfluxDB v1.8.9 (git: 1.8 d9b56321d579)
  • I use the *-alpine variant of the docker images.
  • The size of the backup is roughly 31G.
  • The cardinality of our series is 4206 (as shown by SHOW SERIES CARDINALITY),
    which does not seem that high...

Config:

Config is pretty much default, except the modifications described above.

@sahib sahib changed the title [1.8.9: troubles with restore + tsi1 + insane memory usage] [1.8.9]: troubles with restore + tsi1 + insane memory usage Oct 11, 2021
@tomchon
Copy link

tomchon commented Mar 30, 2022

i have same error like "2022/03/30 21:41:37 Error writing: [DebugInfo: worker #0, dest url: http://test217:8086] Invalid write response (status 500): {"error":"engine is closed"}"

@tuxracer1337
Copy link

Hello folks,

We can confirm: this problem persists.

After a disaster recovery, we cannot import the data from one node to another node.
Influx takes up all SWAP/RAM and after a while the error described above occurs.

@rosscdh
Copy link

rosscdh commented Feb 9, 2024

+1 deadly bad only solution was the binary line_file that worked..

influx_inspect export -datadir /var/lib/influxdb/data -waldir /var/lib/influxdb/wal -out /var/lib/influxdb/influx-backup/backup-${GC_ID} -compress -database metrics -retention autogen

influx -import -compressed -path /backups/${GC_ID}-influx-backup/data/influxdb/influx-backup/backup-${GC_ID}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants