[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

sahib · 2021-10-11T11:33:21Z

Hello,

I have a couple of weird issues restoring a backup made on our production
instance (v1.8.6) to one of our staging environments (v1.8.9). This is an
automated process in our case and worked until 2-3 weeks before.

Since I was hit by #21991 I updated from 1.8.6 to 1.8.9, otherwise I was not
able to even start the backup. Long term we want to upgrade to 2.0, but until
that happens we're stuck with 1.8.x for some time. The whole idea was to test
the switch to tsi1 on staging before doing it on our prod instance, therefore
the backup/restore cycle investigation listed below. But the "real" issue is that
we want to save some memory on our prod instance.

I'm happy to deliver more info if needed. Any idea how to progress here?

Steps to reproduce:

Take backup of prod instance using influxd backup -portable /some/dir
Transfer to staging instance using rsync.
Try to restore using influxd restore -portable /var/lib/influxdb/backup using various config options.

(NOTE: actual commands are slightly longer, due to dockerized env, but effectively the same)

Expected behavior:

Restore would work with 1.8.6 or at least 1.8.9.

Actual behavior:

It does not.

1.8.6 restore fails immediately with an error message similar to the ones in this ticket Backup/restore fails with a lot of databases #9968.
1.8.9 seemingly first works, but eats huge amounts of data (I had to increase
instance size to 32G + swap to progress further). After importing roughly 20G
of data, it crashed with an out-of-memory error (see log). Despite the
error it still had enough memory available.
After setting the indexing back to "inmem" the excessive memory consumption was gone,
but the restore still crashed halfway through (see other log).
After reading this up, I followed a few suggestions which were:
```
Add vm.max_map_count=2048000 to /etc/sysctl.conf and activate it.
Set "max-concurrent-compactions" to 0.
```
With this setup the restore worked (in the sense that the restore command
returned successfully), but still produced a OOM error shortly after. After
a restart of the influxd process the data was (mostly?) there though. I'm
not 100% certain the two previous command did an effect, maybe just "luck".
I forgot to save that log, but it looked pretty much like the previous ones,
except different timestamps.
When trying to restart now with tsi1 enabled, the insane memory consumption happens
again. This seems to be a more general issue in our case.

In all cases starting from 2 to 4 I also see plenty of those logs:

lvl=warn msg="Error while freeing cold shard resources" service=storeerror="engine is closed" db_shard_id=23510

Environment info:

Linux 5.4.0-1029-aws x86_64
InfluxDB v1.8.9 (git: 1.8 d9b56321d579)
I use the *-alpine variant of the docker images.
The size of the backup is roughly 31G.
The cardinality of our series is 4206 (as shown by SHOW SERIES CARDINALITY),
which does not seem that high...

Config:

Config is pretty much default, except the modifications described above.

The text was updated successfully, but these errors were encountered:

tomchon · 2022-03-30T13:45:01Z

i have same error like "2022/03/30 21:41:37 Error writing: [DebugInfo: worker #0, dest url: http://test217:8086] Invalid write response (status 500): {"error":"engine is closed"}"

tuxracer1337 · 2023-07-25T10:28:47Z

Hello folks,

We can confirm: this problem persists.

After a disaster recovery, we cannot import the data from one node to another node.
Influx takes up all SWAP/RAM and after a while the error described above occurs.

rosscdh · 2024-02-09T22:59:15Z

+1 deadly bad only solution was the binary line_file that worked..

influx_inspect export -datadir /var/lib/influxdb/data -waldir /var/lib/influxdb/wal -out /var/lib/influxdb/influx-backup/backup-${GC_ID} -compress -database metrics -retention autogen

influx -import -compressed -path /backups/${GC_ID}-influx-backup/data/influxdb/influx-backup/backup-${GC_ID}

sahib changed the title ~~[1.8.9: troubles with restore + tsi1 + insane memory usage]~~ [1.8.9]: troubles with restore + tsi1 + insane memory usage Oct 11, 2021

sahib mentioned this issue Nov 25, 2021

[2.1] Heavy memory footprint regression from 1.8 inmem to 2.1 #22936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

sahib commented Oct 11, 2021

tomchon commented Mar 30, 2022

tuxracer1337 commented Jul 25, 2023

rosscdh commented Feb 9, 2024

[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

Comments

sahib commented Oct 11, 2021

tomchon commented Mar 30, 2022

tuxracer1337 commented Jul 25, 2023

rosscdh commented Feb 9, 2024