Backup/restore fails with a lot of databases #9968

agoetschm · 2018-06-13T09:15:01Z

Bug report

System info: InfluxDB v1.5.3, installed from brew on Mac OS X 10.12.6

Steps to reproduce:

Start a clean instance of InfluxDB

rm -r .influxdb
influxd

Create some dummy databases

curl -X POST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE test"
curl -X POST http://localhost:8086/write?db=test --data-binary "a i=1"
curl -X POST http://localhost:8086/query --data-urlencode "q=$(perl dummy_data.pl 1 500)"

where dummy_data.pl is

use 5.010;
use strict;
use warnings;
for my $i ($ARGV[0]..$ARGV[1]) {
    my $db = "test$i";
    say "CREATE DATABASE $db WITH DURATION 260w REPLICATION 1 SHARD DURATION 12w NAME rp2;";
    say "CREATE RETENTION POLICY rp1 ON $db DURATION 100d REPLICATION 1 SHARD DURATION 2w;";
    say "CREATE CONTINUOUS QUERY cq1 ON $db RESAMPLE EVERY 5m FOR 10m BEGIN SELECT LAST(a) AS b, c INTO $db.rp2.m FROM $db.rp1.m GROUP BY time(5m) END;";
    say "CREATE CONTINUOUS QUERY cq2 ON $db RESAMPLE EVERY 5m FOR 10m BEGIN SELECT MAX(a) AS b, c INTO $db.rp2.m FROM $db.rp1.m GROUP BY time(5m) END;";
}

Backup everything

rm -r ./backup
influxd backup -portable ./backup

Try to restore the database test

influxd restore -portable -db test -newdb test_bak backup/

Expected behavior: The database test is restored as test_bak

Actual behavior: Restoring the database fails (most of the time...) with the message error updating meta: DB metadata not changed. database may already exist, even if test_bak does not exist.
I wasn't able to understand to resulting log line, where RetentionPolicyInfo isn't always the same:

failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)

Additional info: This behaviour seems to depend on the amount of metadata. If I add only 100 dummy databases instead of 500 (curl -X POST http://localhost:8086/query --data-urlencode "q=$(perl dummy_data.pl 1 100)"), everything works well.

Me trying to restore a few times, where the 6th attempt worked:

➜  ~ rm -r ./backup
influxd backup -portable ./backup
2018/06/13 10:27:37 backing up metastore to backup/meta.00
2018/06/13 10:27:37 No database, retention policy or shard ID given. Full meta store backed up.
2018/06/13 10:27:37 Backing up all databases in portable format
2018/06/13 10:27:37 backing up db=
2018/06/13 10:27:37 backing up db=test rp=autogen shard=1 to backup/test.autogen.00001.00 since 0001-01-01T00:00:00Z
2018/06/13 10:27:37 backing up db=_internal rp=monitor shard=2 to backup/_internal.monitor.00002.00 since 0001-01-01T00:00:00Z
2018/06/13 10:27:37 backup complete:
2018/06/13 10:27:37 	backup/20180613T082737Z.meta
2018/06/13 10:27:37 	backup/20180613T082737Z.s1.tar.gz
2018/06/13 10:27:37 	backup/20180613T082737Z.s2.tar.gz
2018/06/13 10:27:37 	backup/20180613T082737Z.manifest
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:45 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:52 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:53 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:54 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:54 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:55 Restoring shard 1 live from backup 20180613T082737Z.s1.tar.gz
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:57 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:58 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist

The corresponding logs:

2018-06-13T08:27:37.023239Z	info	Cache snapshot (start)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "op_event": "start"}
2018-06-13T08:27:37.026848Z	info	Snapshot for path written	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "path": "/Users/ang/.influxdb/data/test/autogen/1", "duration": "3.621ms"}
2018-06-13T08:27:37.026885Z	info	Cache snapshot (end)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "3.657ms"}
2018-06-13T08:27:37.031269Z	info	Cache snapshot (start)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "op_event": "start"}
2018-06-13T08:27:37.033460Z	info	Snapshot for path written	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "path": "/Users/ang/.influxdb/data/_internal/monitor/2", "duration": "2.198ms"}
2018-06-13T08:27:37.033493Z	info	Cache snapshot (end)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "2.230ms"}
2018-06-13T08:27:45.624373Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:52.234943Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:53.457241Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:54.170693Z	info	failed to decode meta: proto: meta.DatabaseInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:54.841937Z	info	failed to decode meta: proto: meta.Data: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:55.620080Z	info	Opened file	{"log_id": "08f3wpxW000", "engine": "tsm1", "service": "filestore", "path": "/Users/ang/.influxdb/data/test_bak/autogen/3/000000001-000000001.tsm", "id": 0, "duration": "0.158ms"}
2018-06-13T08:27:57.340738Z	info	failed to decode meta: proto: meta.Data: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:58.570292Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}

The text was updated successfully, but these errors were encountered:

nimdanitro · 2018-07-02T06:47:31Z

I can confirm this behavior.. Experiencing exactly the same with v1.5.2 and v1.5.4

galindro · 2018-11-18T15:28:18Z

I can confirm too in v1.6.4. Is there workaround?

simztypeuk · 2018-12-31T13:45:11Z

Any word on a fix for this? issue still present in v1.7.2

Edit: Just to say, original database was v1.3.6, recently upgraded to v1.7.2, indexes rebuilt to TSI1 immediately afterwards and operating fine since then. Backups withy -portable appear okay, but first time attempted to restore in a separate instance of InfluxDB, with 7 other databases already present, but not the one attempting to restore, failed immediately with the "illegal tag 0 (wire type 0)" error.

bitfisher · 2019-03-06T13:51:42Z

Issue is still present in v1.7.4

Also restore of legacy backup with "-online" fails with same error.
failed to decode meta: proto: meta.DatabaseInfo: illegal tag 0 (wire type 0)

Legacy backup and legacy restore (using -metadir and -datadir) is working but it seems that some data is not restored correctly.
There were no warnings or errors while restoring.

hongquan · 2019-03-08T06:43:34Z

I found that, restore on real machine (laptop) works, but not on any server with virtual disk.
I tried with VPS from Digital Ocean, Azure and some Vietnamese provider (vHost, Vinahost, VCCloud, Vietnix). I also tried bare metal server from Scaleway, which comes with network disk. All failed to restore InfluxDB database (portable mode).

Log from client:

$ influxd restore -portable -db daothanh /tmp/db/ts_daothanh
2019/03/08 08:12:44 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:54174->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:54174->127.0.0.1:8088: read: connection reset by peer, n=16

Log from server:

$ journalctl -u influxdb

...
Mar 08 08:12:44 db-ams influxd[6119]: ts=2019-03-08T08:12:44.044335Z lvl=info msg="failed to decode meta: proto: meta.ShardGroupInfo: illegal ta
...

alexferl · 2019-07-09T00:17:08Z

Same issue as @hongquan on Google Cloud's Kubernetes Engine on 1.7.7. Why is restoring on Docker/Kubernetes so hard? Even this new method requires tons of setting up for something that should be easy.

synatree · 2019-08-21T17:09:09Z

I'm trying to restore just one database, total size of the backup is about 3.2G (Portable version) or 4.9G (legacy version).

The intended restore server is empty, brand new install. I've tried both the -host transport option and plain old rsync of the backup directory, both result in the same error as noted above.

ebarault · 2019-09-05T08:54:36Z

Restoring remotely on influxdb 1.7.8, running in docker

influxd cli error message

2019/09/05 08:52:08 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist

influxdb log error message

ts=2019-09-05T08:48:59.059296Z lvl=info msg="failed to decode meta: proto: meta.ShardGroupInfo: illegal tag 0 (wire type 0)" log_id=0Hh9BaOG000 service=snapshot

UPDATE_1: according to this comment, it might be related to docker networking and restore works when executed inside the container, not remotely as i do

UPDATE_2: (docker/network_mode: host) running the restore from within the influxdb docker container AND from the docker host (not inside the container) worked, while it fails when running from a remote host

UPDATE_3: (docker/network_mode: bridge) running the restore from within the influxdb docker container failed from the docker host worked, while it worked when running from within inside the container

"SOLUTION": i tested a number of different combinations of remote hosts and influxd versions and restore always failed when running remote, so I ended scripting with ansible a restore script that runs local on the container host. If using docker network_mode bridge i need to run the restore command inside the container.

cc: @dgnorton, have you seen anything like this before?

otolizz · 2019-10-21T12:19:47Z

Hello,
I have the same issue here.
I can't even backup / restore on the same instance.

I run a standalone influxdb 1.7.8. (i have different instances, some of them have only 1 DB, others could have a hundred DBs, but all in 1.7.8)
I can make a portable backup but when i want to restore it (on the same instance), i always got :
2019/10/21 13:53:57 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:49622->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:49622->127.0.0.1:8088: read: connection reset by peer, n=16

In the influxdb log:
influxd[18547]: ts=2019-10-21T10:51:26.281478Z lvl=info msg="failed to decode meta: proto: meta.ShardGroupInfo: illegal tag 0 (wire type 0)" log_id=0IcUppUW00

I don't even know what to do to managed incremental backup / restore now.... :(

onstring · 2019-11-01T05:22:04Z

We meet with the same issue when I restore influxdb.

Backup on influx v1.7.6, and restore on the empty host with the same version influxdb.
The restored host is empty one so does not have any existing DB or retention policies.
We copied over the backup folder when doing the restoring, so it does not relate to the local/remote issue someone mentioned. Fyi, we did not use docker.
Tried to backup all the databases and single database, restoring is failed with the same msg:

Nov 01 16:01:36 xxxxxx influxd[28340]: ts=2019-11-01T05:01:36.659436Z lvl=info msg="failed to decode meta: proto: meta.ShardGroupInfo: illegal tag 0 (wire type 0)" log_id=0IqLGmyW000 service=snapshot

Update: I've tried the legacy offline backup and restore as well. The restoration is working, but unfortunately, the restore or the backup is not complete, there are some data points missing. (we visualized and compared the original and restored influxdb data)

Update2: We tried to use better performance servers to restore and the restore was working then and won't see the error again.

cantino · 2019-11-19T22:11:04Z

This seems pretty critical. Is this prioritized?

sorrison · 2019-11-19T23:08:55Z

We had this issue and we found that using a faster server (more cores, ram and iops) made the restore work

superbool · 2019-12-16T13:13:10Z

same problem . backup 1.7.8 and restore 1.7.9 . ERROR: failed to decode meta: proto: meta.ShardGroupInfo: illegal tag 0 (wire type 0) . Restore to 1.7.8 same error too.

ebarault · 2019-12-16T14:53:18Z

@superbool please read my comment: #9968 (comment)

superbool · 2019-12-16T17:40:50Z

same problem . backup 1.7.8 and restore 1.7.9 . ERROR: failed to decode meta: proto: meta.ShardGroupInfo: illegal tag 0 (wire type 0) . Restore to 1.7.8 same error too.

I have solved the problem by update system. But I did't not know which software effect.

sudo yum update
sudo reboot

simztypeuk · 2019-12-16T18:28:22Z

same problem . backup 1.7.8 and restore 1.7.9 . ERROR: failed to decode meta: proto: meta.ShardGroupInfo: illegal tag 0 (wire type 0) . Restore to 1.7.8 same error too.

I have solved the problem by update system. But I did't not know which software effect.
sudo yum update
sudo reboot

Supercool Superbool :)

We were previously seeing failures on 1.7.8 running over Amazon Linux 2. Just applied updates and bumped to 1.7.9, and hey presto, portable backup from one of our AWS servers has just imported into another server running the same software levels.

Happy days!

alexferl · 2020-01-11T03:42:27Z

This is still happening on 1.7.9 on Kubernetes. Even on a ridiculous 96 vCPU 256GB RAM VM trying to restore a measly 43GB dump.

/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:37:07 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44306->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44306->127.0.0.1:8088: read: connection reset by peer, n=16
/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:39:50 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44518->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44518->127.0.0.1:8088: read: connection reset by peer, n=16
/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:39:54 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44530->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44530->127.0.0.1:8088: read: connection reset by peer, n=16
/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:40:00 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44536->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44536->127.0.0.1:8088: read: connection reset by peer, n=16
/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:40:04 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44544->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44544->127.0.0.1:8088: read: connection reset by peer, n=16
/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:40:09 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44546->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44546->127.0.0.1:8088: read: connection reset by peer, n=16
/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:40:13 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44554->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44554->127.0.0.1:8088: read: connection reset by peer, n=16
/var/lib/influxdb # influxd restore -portable influx_dump
2020/01/11 03:40:18 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44560->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:44560->127.0.0.1:8088: read: connection reset by peer, n=16

ts=2020-01-11T03:37:07.469698Z lvl=info msg="failed to decode meta: proto: meta.ShardInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot
ts=2020-01-11T03:39:50.536684Z lvl=info msg="failed to decode meta: proto: meta.DatabaseInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot
ts=2020-01-11T03:39:54.912413Z lvl=info msg="failed to decode meta: proto: meta.DatabaseInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot
ts=2020-01-11T03:40:00.485891Z lvl=info msg="failed to decode meta: proto: meta.ShardInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot
ts=2020-01-11T03:40:04.951187Z lvl=info msg="failed to decode meta: proto: meta.ShardInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot
ts=2020-01-11T03:40:09.317410Z lvl=info msg="failed to decode meta: proto: meta.ShardInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot
ts=2020-01-11T03:40:13.708552Z lvl=info msg="failed to decode meta: proto: meta.ShardInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot
ts=2020-01-11T03:40:18.090725Z lvl=info msg="failed to decode meta: proto: meta.ShardInfo: illegal tag 0 (wire type 0)" log_id=0KGdIo5G000 service=snapshot

code-mechine · 2020-03-13T05:12:24Z

my influxdb version is: 1.7.6
I found the problem in file ”influxdata/influxdb/services/snapshotter/service.go“:
`func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) {
var r Request
d := json.NewDecoder(conn)

if err := d.Decode(&r); err != nil {
	return r, nil, err
}

bits := make([]byte, r.UploadSize+1)

if r.UploadSize > 0 {

	remainder := d.Buffered()

	n, err := remainder.Read(bits)
	if err != nil && err != io.EOF {
		return r, bits, err
	}

	// it is a bit random but sometimes the Json decoder will consume all the bytes and sometimes
	// it will leave a few behind.
	if err != io.EOF && n < int(r.UploadSize+1) {
		_, err = conn.Read(bits[n:])
	}

	if err != nil && err != io.EOF {
		return r, bits, err
	}
	// the JSON encoder on the client side seems to write an extra byte, so trim that off the front.
	return r, bits[1:], nil
}

return r, bits, nil

}`
This func is to read the contents of the metadata file from the TCP connection. Because the file is too large, it needs to be sent several times, but it only receives two times at most when it is received. That is to say, when the metadata file is too large, it can not receive the complete content. So it should be improved here to fully receive the data sent by the client.

I have modified and compiled this part of code in my environment. After modification, there will be no such problem as "proto: meta.data: illegal tag 0 (wire type 0)". The restore command is executed successfully.

code-mechine · 2020-03-13T07:31:52Z

my influxdb version is: 1.7.6
I found the problem in file ”influxdata/influxdb/services/snapshotter/service.go“:
`func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) {
var r Request
d := json.NewDecoder(conn)
if err := d.Decode(&r); err != nil {
	return r, nil, err
}

bits := make([]byte, r.UploadSize+1)

if r.UploadSize > 0 {

	remainder := d.Buffered()

	n, err := remainder.Read(bits)
	if err != nil && err != io.EOF {
		return r, bits, err
	}

	// it is a bit random but sometimes the Json decoder will consume all the bytes and sometimes
	// it will leave a few behind.
	if err != io.EOF && n < int(r.UploadSize+1) {
		_, err = conn.Read(bits[n:])
	}

	if err != nil && err != io.EOF {
		return r, bits, err
	}
	// the JSON encoder on the client side seems to write an extra byte, so trim that off the front.
	return r, bits[1:], nil
}

return r, bits, nil
}`
This func is to read the contents of the metadata file from the TCP connection. Because the file is too large, it needs to be sent several times, but it only receives two times at most when it is received. That is to say, when the metadata file is too large, it can not receive the complete content. So it should be improved here to fully receive the data sent by the client.

I have modified and compiled this part of code in my environment. After modification, there will be no such problem as "proto: meta.data: illegal tag 0 (wire type 0)". The restore command is executed successfully.

And my solution is：
`
func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) {
var r Request
d := json.NewDecoder(conn)

if err := d.Decode(&r); err != nil {
    return r, nil, err
}
var buffer bytes.Buffer
if r.UploadSize > 0 {
    bits := make([]byte, r.UploadSize+1)
    remainder := d.Buffered()
    n, err := remainder.Read(bits)
    if err != nil && err != io.EOF {
        return r, bits, err
    }
    fmt.Println("remainder num: ", n)
    buffer.Write(bits[0:n])
    // Set the timeout according to the actual situation
    _ = conn.SetReadDeadline(time.Now().Add(20 * time.Second))
    for {
        //bs := make([]byte, r.UploadSize-int64(n+rn))
        nu, err := conn.Read(bits)
        if err != nil && err != io.EOF {
            return r, buffer.Bytes(), err
        }
        if err != io.EOF && n < int(r.UploadSize+1) {
            buffer.Write(bits[0:nu])
            n += nu
            if n >= int(r.UploadSize) {
                // upStream receiving completed
                break
            }
            continue
        }
    }
    // the JSON encoder on the client side seems to write an extra byte, so trim that off the front.
    return r, buffer.Bytes()[1:], nil
}
return r, buffer.Bytes(), nil

}
`

…file is too large, the influxdb server cannot fully receive the meta file sent by the client. issue link: influxdata#9968 (comment)

ayang64 · 2020-03-31T00:06:36Z

I submitted the patch above. Anyone interested, please review so we can get this merged as quickly as possible.

trzad · 2020-04-21T09:41:34Z

@ebarault:

Did you try to adjust the tcp window size, like in the example below to 8Mbytes?

echo 'net.core.wmem_max=4194304' >> /etc/sysctl.conf
echo 'net.core.rmem_max=12582912' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 4194304' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 87380 4194304' >> /etc/sysctl.conf
sysctl -p

It worked for me.

roman-vynar · 2020-09-08T12:41:21Z

Same problem on versions 1.8.0, 1.8.2, haven't tried 1.8.1.

roman-vynar · 2020-09-08T13:23:33Z

The fix from PR #17495 has been merged to master-1.x on 31 Mar.
If I rebuild influxd from master-1.x it works.

So the fix was not released neither with 1.8.0 (Jun), nor 1.8.1 (Jul), nor 1.8.2 (Aug) 😐
Why? People can't restore backups...
However I have to say some backups do restore ok, but some don't due to this issue.

KoosBusters · 2020-10-01T07:02:58Z

Is there any indication when this will be fixed?
We also cannot restore our backups. I tried both on Window and Linux (v1.7.6 to v1.8.2).
Any good documentation how to rebuild influxd from master-1.x?

alexferl · 2020-10-09T19:36:58Z

Since this is still not released I made my own build + docker image I use for restores only. Do not use this for anything else. Works on Kubernetes also.

Here's how I did it:

Building influx (requires go installed on your computer):

git clone https://github.com/influxdata/influxdb.git
cd influxdb
git checkout master-1.x
go build ./cmd/influxd 
or if you aren't on Linux: 
env GOOS=linux GOARCH=amd64 go build ./cmd/influxd

Building docker image:

git clone https://github.com/influxdata/influxdata-docker.git
cd influxdata-docker/1.8/alpine
cp <influxdb_build_path>/influxd .

Modify Dockerfile to this:

FROM alpine:3.12

RUN echo 'hosts: files dns' >> /etc/nsswitch.conf
RUN apk add --no-cache tzdata bash ca-certificates && \
    update-ca-certificates

COPY influxd /usr/bin/influxd
COPY influxdb.conf /etc/influxdb/influxdb.conf

EXPOSE 8086

VOLUME /var/lib/influxdb

COPY entrypoint.sh /entrypoint.sh
COPY init-influxdb.sh /init-influxdb.sh
ENTRYPOINT ["/entrypoint.sh"]
CMD ["influxd"]

docker build -t <some_tag> .

sj14 · 2020-11-04T13:38:09Z

I just had to move a database and had the same issues as described here. The easiest solution for me was to use the legacy backup approach, without the -portable flag.

laradji · 2020-11-13T16:33:24Z

@alexferl Tks, it work with the 1.x branch. you need to build the binary with this command to be static :

env CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags '-extldflags "-static"' ./cmd/influxd

roman-vynar · 2020-12-18T09:26:10Z

Again the fix is not included in 1.8.3 (Sep).

While influxdata doesn't care, here is all-in-one Dockerfile to build a new release from master-1.x:

FROM golang:1.15.5-alpine3.12 as builder

RUN set -ex && \
    apk update && \
    apk add ca-certificates git bash gcc musl-dev && \
    git config --global http.https://gopkg.in.followRedirects true && \
    git clone --depth 1 --branch master-1.x https://github.com/influxdata/influxdb.git /opt/ && \
    cd /opt && \
    go build ./cmd/influxd && \
    chmod +x influxd

RUN git clone --depth 1 --branch master https://github.com/influxdata/influxdata-docker.git /opt2


FROM alpine:3.12

RUN echo 'hosts: files dns' >> /etc/nsswitch.conf
RUN set -ex && \
    apk add --no-cache tzdata bash ca-certificates && \
    update-ca-certificates

COPY --from=builder /opt/influxd /usr/bin/influxd
COPY --from=builder /opt2/influxdb/1.8/alpine/influxdb.conf /etc/influxdb/influxdb.conf

EXPOSE 8086

VOLUME /var/lib/influxdb

COPY --from=builder /opt2/influxdb/1.8/alpine/entrypoint.sh /entrypoint.sh
COPY --from=builder /opt2/influxdb/1.8/alpine/init-influxdb.sh /init-influxdb.sh
ENTRYPOINT ["/entrypoint.sh"]
CMD ["influxd"]

docker build -t yourimage:1.8.x .

teu · 2021-02-19T15:46:11Z

@roman-vynar I've tried it and this does not work for me.

roman-vynar · 2021-02-19T16:56:45Z

@teu what exactly doesn't work, docker build or backup tool?

teu · 2021-02-21T09:22:53Z

@roman-vynar I've modified the dockerfile a bit, perhaps I've made a mistake:

FROM golang:1.15.5-alpine3.12 as builder

RUN set -ex && \
    apk update && \
    apk add ca-certificates git bash gcc musl-dev && \
    git config --global http.https://gopkg.in.followRedirects true && \
    git clone --depth 1 --branch master-1.x https://github.com/influxdata/influxdb.git /opt/ && \
    cd /opt && \
    go build ./cmd/influxd && \
    chmod +x influxd

#RUN git clone --depth 1 --branch master https://github.com/influxdata/influxdata-docker.git /opt2

FROM influxdb:1.8.3

COPY --from=builder /opt/influxd /usr/bin/influx*
RUN chmod a+x /usr/bin/influxd

RUN apt-get update -yqq \
    && apt-get install -yqq --no-install-recommends unzip python gnupg cron vim curl ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN cd ~/ \
&& curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
&& unzip awscliv2.zip \
&& ./aws/install \
&& aws --version

ADD ./entrypoint.sh /entrypoint.sh
ADD ./etc/influxdb.conf /etc/influxdb/influxdb.conf
RUN  chmod a+x /entrypoint.sh && \
    chmod a+rwx -R /etc/influxdb/

ENTRYPOINT ["/entrypoint.sh"]
CMD ["influxd"]

I am doing a backup on v1.7.9 on our production env, then trying to restore it with this image, via remote restore over port 8088. Still getting the issue with metadata.

teu · 2021-02-22T09:41:41Z

@roman-vynar DIsregard last message. The dockerfile is mixing alpine and ubuntu images (which wont work obviously now). I've build the influxd and copied it onto an image. I've tried every combination, still same issue:

influxd restore -portable -host influxdb.default.svc.cluster.local:8088 /backup -database customer_db                        
│ influxdb-backup-cron-restore-6-8vgqs:2021/02/22 09:44:52 error updating meta: updating metadata on influxd service failed: err=read tcp 10.1.10.184:56716->172.20.98.147:8088: re 
│ ad: connection reset by peer, n=16                                                                                                                                                
│ influxdb-backup-cron-restore-6-8vgqs:restore: updating metadata on influxd service failed: err=read tcp 10.1.10.184:56716->172.20.98.147:8088: read: connection reset by peer, n= 16

Note that my backups are being made on 1.7.11 from a tag, not master.

AntonBiryukovUofC · 2021-02-22T23:54:18Z

Getting this crap actively on a Circle CI machine executor. Does not happen when a job is re-run with SSH, happens with a probability of 80% of a regular commit.

Also started happening after migration to Ubuntu 20.04 image from 16.04. Wonder if that could be a culprit...

palmamartin · 2021-03-01T10:14:57Z

We just ran into this issues. It would be nice if the fix #17495 gets backported to 1.8.x

rogierslag · 2021-03-01T15:10:39Z

We had the same problem. Luckily the original server was still available, with line protocol exports this worked. However it took over a week to import everything, and would have been impossible if the original server crashed or deleted.

For us not being able to restore databases reliably was the final straw to decide to move away from influx. Lots of valuable business data could have been lost because of this

roman-vynar · 2021-03-01T15:18:33Z

For us not being able to restore databases reliably was the final straw to decide to move away from influx. Lots of valuable business data could have been lost because of this

Same here. We are moving away from influx which is much behind any other competitor.

efidi · 2021-03-15T09:20:03Z

@roman-vynar @gusutabopb
I'm having problem building influxd from source. I've tried both ways (laptop with Debian 10 and roman-vynar's Dockerfile) but I always encounter problems with the flux package:

Package flux was not found in the pkg-config search path. Perhaps you should add the directory containing 'flux.pc' to the PKG_CONFIG_PATH environment variable Package flux not found pkg-config: exit status 1

Am I missing something?

UPDATE
Even if the problem stated at the beginning of this post still exists, I managed to bypass it by checking out an older version of the branch master-1.x.

Thank you very much for providing this solution that allowed me to restore a bunch of databases that otherwise would have been lost!

haniha · 2021-05-08T09:37:50Z

the problem still exists on 1.8.5 (?)

roman-vynar · 2021-05-08T15:05:42Z

the problem still exists on 1.8.5 (?)

Yes, tried yesterday.

haniha · 2021-05-08T15:25:46Z

the problem still exists on 1.8.5 (?)

Yes, tried yesterday.

same here, just asking to be sure
didn't this commit solve the problem?

miettal · 2021-05-10T02:36:07Z

It seems #17495 will be merged in next release v1.9.0.

https://github.com/influxdata/influxdb/blob/b26a2f7a0e41349938cec592a2abac4d93c9ab1c/CHANGELOG.md
#17495: fix(snapshotter): properly read payload

roman-vynar · 2021-05-10T07:40:26Z

didn't this commit solve the problem?

Yes, it helps.

It seems #17495 will be merged in next release v1.9.0.

lesam · 2021-09-09T14:29:20Z

This should be fixed by a combination of #21991 (in 1.8.9) and #17495 (in 1.8.10).  

I was able to duplicate with #9968 (comment) on some tries with v1.8.0  

I was not able to duplicate on latest 1.8 including #22427 (coming in 1.8.10). Ran a script to run the repro 20x. Will close this when the 1.8 backport for #22427 closes.

lesam · 2021-09-09T15:01:50Z

#22427 is merged - closing.

agoetschm changed the title ~~Backup/restore fails when there are too many databases~~ Backup/restore fails with a lot of databases Jun 13, 2018

e-dard added area/backup and restore kind/bug labels Jun 13, 2018

dgnorton added the 1.x label Jan 7, 2019

dgnorton added this to the 1.8.0 milestone Nov 19, 2019

code-mechine mentioned this issue Mar 13, 2020

Optimize content: when executing the restore command, when the meta … #17247

Closed

ayang64 self-assigned this Mar 30, 2020

ayang64 mentioned this issue Mar 31, 2020

fix(snapshotter): Properly read request body #17495

Merged

3 tasks

lesam mentioned this issue Sep 9, 2021

fix(snapshotter): properly read payload (#17495) #22427

Merged

4 tasks

lesam assigned lesam and unassigned ayang64 Sep 9, 2021

lesam closed this as completed Sep 9, 2021

influxdata deleted a comment from sabretus Sep 10, 2021

influxdata deleted a comment from palmamartin Sep 10, 2021

sahib mentioned this issue Oct 11, 2021

[1.8.9]: troubles with restore + tsi1 + insane memory usage #22646

Open

Backup/restore fails with a lot of databases #9968

Backup/restore fails with a lot of databases #9968

Comments

agoetschm commented Jun 13, 2018

Bug report

nimdanitro commented Jul 2, 2018

galindro commented Nov 18, 2018

simztypeuk commented Dec 31, 2018 • edited Loading

bitfisher commented Mar 6, 2019 • edited Loading

hongquan commented Mar 8, 2019 • edited Loading

alexferl commented Jul 9, 2019 • edited Loading

synatree commented Aug 21, 2019

ebarault commented Sep 5, 2019 • edited Loading

otolizz commented Oct 21, 2019 • edited Loading

onstring commented Nov 1, 2019 • edited Loading

cantino commented Nov 19, 2019

sorrison commented Nov 19, 2019

superbool commented Dec 16, 2019

ebarault commented Dec 16, 2019

superbool commented Dec 16, 2019

simztypeuk commented Dec 16, 2019

alexferl commented Jan 11, 2020

code-mechine commented Mar 13, 2020

code-mechine commented Mar 13, 2020

ayang64 commented Mar 31, 2020

trzad commented Apr 21, 2020 • edited Loading

roman-vynar commented Sep 8, 2020

roman-vynar commented Sep 8, 2020

KoosBusters commented Oct 1, 2020

alexferl commented Oct 9, 2020 • edited Loading

sj14 commented Nov 4, 2020

laradji commented Nov 13, 2020 • edited Loading

roman-vynar commented Dec 18, 2020 • edited Loading

teu commented Feb 19, 2021

roman-vynar commented Feb 19, 2021

teu commented Feb 21, 2021

teu commented Feb 22, 2021 • edited Loading

AntonBiryukovUofC commented Feb 22, 2021

palmamartin commented Mar 1, 2021

rogierslag commented Mar 1, 2021

roman-vynar commented Mar 1, 2021

efidi commented Mar 15, 2021 • edited Loading

haniha commented May 8, 2021 • edited Loading

roman-vynar commented May 8, 2021 • edited Loading

haniha commented May 8, 2021

miettal commented May 10, 2021

roman-vynar commented May 10, 2021 • edited by timhallinflux Loading

lesam commented Sep 9, 2021

lesam commented Sep 9, 2021

simztypeuk commented Dec 31, 2018 •

edited

Loading

bitfisher commented Mar 6, 2019 •

edited

Loading

hongquan commented Mar 8, 2019 •

edited

Loading

alexferl commented Jul 9, 2019 •

edited

Loading

ebarault commented Sep 5, 2019 •

edited

Loading

otolizz commented Oct 21, 2019 •

edited

Loading

onstring commented Nov 1, 2019 •

edited

Loading

trzad commented Apr 21, 2020 •

edited

Loading

alexferl commented Oct 9, 2020 •

edited

Loading

laradji commented Nov 13, 2020 •

edited

Loading

roman-vynar commented Dec 18, 2020 •

edited

Loading

teu commented Feb 22, 2021 •

edited

Loading

efidi commented Mar 15, 2021 •

edited

Loading

haniha commented May 8, 2021 •

edited

Loading

roman-vynar commented May 8, 2021 •

edited

Loading

roman-vynar commented May 10, 2021 •

edited by timhallinflux

Loading