Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup failure: Download shard n failed copy backup to file #16739

Closed
gholmes opened this issue Feb 5, 2020 · 28 comments · Fixed by #22703 or #22732
Closed

Backup failure: Download shard n failed copy backup to file #16739

gholmes opened this issue Feb 5, 2020 · 28 comments · Fixed by #22703 or #22732

Comments

@gholmes
Copy link

gholmes commented Feb 5, 2020

Steps to reproduce:
Execute a backup command like this:
influxd backup -database mydb -portable -since 2018-01-01T00:00:00Z backup_$(date +%Y%m%d_%H%M%S)

Expected behavior:
Get a backup written to files.

Actual behavior:
Fails to generate a backup. Console output is as follows.

2020/02/05 14:40:47 backing up db=mydb
2020/02/05 14:40:47 backing up db=mydb rp=my_rp shard=122 to backup_20200205_144047/mydb.my_rp.00122.00 since 2018-01-01T00:00:00Z
2020/02/05 14:40:47 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (0)...
2020/02/05 14:40:49 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (1)...
2020/02/05 14:40:52 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (2)...
2020/02/05 14:40:54 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (3)...
2020/02/05 14:40:56 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (4)...
2020/02/05 14:40:58 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (5)...
2020/02/05 14:41:01 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 3.01s and retrying (6)...
2020/02/05 14:41:04 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 11.441s and retrying (7)...
2020/02/05 14:41:15 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 43.477s and retrying (8)...
2020/02/05 14:41:59 Download shard 122 failed copy backup to file: err=<nil>, n=0.  Waiting 2m45.216s and retrying (9)...
2020/02/05 14:44:44 error (copy backup to file: err=<nil>, n=0) when backing up db: mydb, rp my_rp, shard 122. continuing backup on remaining shards
2020/02/05 14:44:44 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0 

Only once have I gotten a backup, and that was when the database was newly-created.

  • The directory that contains the shard noted in the output above contained hundreds of empty temp directories. Deleting those directories didn't change anything with the backup failure.
  • Of the 132 shards, one has 8 tsm files, the problem shard has one tsm file and the others have one tsm file.
  • Running influx_inspect commands verify and inspect show no issues.

Environment info:

  • System info: Linux 4.15.0-1066-azure x86_64
  • InfluxDB version: InfluxDB v1.7.0 (git: 1.7 dac4c6f571662c63dc0d73346787b8c7f113222a)
  • Other relevant environment details:
    Influx instance is running on a pod in a AKS cluster. Database files are on a mounted Azure File share.

Config:
Non-default configs are
reporting-disabled = true
bind-address = "0.0.0.0:8088"
auth-enabled = true

@russorat
Copy link
Contributor

@gholmes thanks for opening this. Have you tried with the latest 1.7.9 influxdb? Also, where are you running the backup command? is it from inside your k8s cluster or remote?

@gholmes
Copy link
Author

gholmes commented Feb 14, 2020

Yes, I tried with 1.7.9 and got the same results. I ran the backup command locally, from another container in the cluster, and from a host outside the k8s cluster. I did learn that part of the problem is that I was mounting Azure file shares, which don't support hard links. Failures creating hard links leave the temp directories to accumulate, which seems to be at least part of the issue with backing up. I have since switched to mounting Azure Disk and I'm now able to run a backup and the temp directories don't accumulate. I think the problem that needs to be addressed is that the temp directories aren't cleaned up after the failure to create a hard link. I'm also wondering why the hard links are created at all.

@ronaldoo9
Copy link

@gholmes
containers
fehler

Hello
did you solve this problem. I have the same problem
Thank you

@palbornozl
Copy link

Hi, same here with the problem (influx v1.7.9) with azure, any news about it?

Command

influxd backup -portable -database 'myDB' 'myDB_bkp'
2020/04/27 16:52:04 backing up metastore to myDB_bkp/meta.00
2020/04/27 16:52:05 backing up db=myDB
2020/04/27 16:52:06 backing up db=myDB rp=autogen shard=681 to myDB_bkp/myDB.autogen.00681.00 since 0001-01-01T00:00:00Z
2020/04/27 16:52:07 Download shard 681 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (0)...
2020/04/27 16:52:11 Download shard 681 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (1)...

Log

ts=2020-04-27T20:52:13.824530Z lvl=info msg="error creating tsm hard link: \"link /var/lib/influxdb/data/myDB/autogen/681/000000023-000000002.tsm /var/lib/influxdb/data/myDB/autogen/681/14.tmp/000000023-000000002.tsm: operation not supported\"" log_id=0MRIgTX0000 service=snapshot

Thanks!

@ayang64 ayang64 self-assigned this Apr 27, 2020
@ronaldoo9
Copy link

ronaldoo9 commented May 4, 2020 via email

@palbornozl
Copy link

Thanks @ronaldoo9 to reply, but is it necessary to add "bind-address =": 8088 " if by default has set 8088?
So, where do I need to add it? HTTP section o a new section?

Thanks!

@ronaldoo9
Copy link

ronaldoo9 commented May 5, 2020 via email

@ronaldoo9
Copy link

ronaldoo9 commented May 5, 2020 via email

@ronaldoo9
Copy link

ronaldoo9 commented Jun 2, 2020 via email

@palbornozl
Copy link

you need add it a new section ! just like on picture Am Di., 5. Mai 2020 um 15:28 Uhr schrieb Patricia Albornoz L < notifications@github.com>:

Thanks @ronaldoo9 https://github.com/ronaldoo9 to reply, but is it necessary to add "bind-address =": 8088 " if by default has set 8088? So, where do I need to add it? HTTP section o a new section? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16739 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOYZBVI4L2SHHR2GNNWTVQ3RQAIBHANCNFSM4KQQKCYQ .

Sorry for my delay, but which picture are you refer to?
Thanks

@ronaldoo9
Copy link

ronaldoo9 commented Jun 3, 2020 via email

@chenzanlee
Copy link

image
I have encoutered the same issue.Does someone can help?
image
I backup the the influxdb database from a machine which installs v1.7.8 to a machine that install influxdb-v1.8.5.

@LeonChadwick
Copy link

LeonChadwick commented Oct 16, 2020

I have this same problem on 1.8.2 influx.
Am not running under kubernetes.
Am running the backup command directly on the same host as the database.
Many shard files are successfully backed up before reaching the problem shard, so its not a port 8088/binding issue as the transfers worked up to the point it fails as in the last screenshot above.

err=<nil> isn't a helpful message, no matter what the underlying problem, the tool should be doing better to explain what the problem is (in addition to please fixing whatever the underlying problem is here, clearly its been around since v1.7 at least).

@kozlovsky
Copy link

It seems to be related to #9923: if there was no free space on the device, then InfluxDB stops saving data to disk for some metrics and will not restore even if space was freed. In that state, the backup functionality will not work until the service restart. After the restart, some data become lost, but the backup functionality starts working again. At least it was so in my case.

@grath10
Copy link

grath10 commented Dec 11, 2020

Stuck into the same problem.
I got this message when I looked up the log from influxdb:
2020-12-11T02:36:39.335325Z info error creating tsm hard link: "link /var/lib/influxdb/data/....../autogen/2/000000001-000000001.tsm /var/lib/influxdb/data/..../autogen/2/1.tmp/000000001-000000001.tsm: operation not permitted" {"log_id": "0Q~vJQ0W000", "service": "snapshot"}
Can anyone give any suggestion as to solve the problem?Thx!

@mdaguete
Copy link

Same problem here with 1.8.2 version. The problem always appears when trying to backup the last shard even if it doesn't contain data of the period indicated by start and end.

@zeinsteinz
Copy link

Same problem here with 1.8.2 version. The problem always appears when trying to backup the last shard even if it doesn't contain data of the period indicated by start and end.

Agree, same problem with 1.8.2 version on Ubuntu 16.04.

@2012ucp1544
Copy link

#9923

I am also facing the same issue, where restarting influxdb helped out, but this behaviour is not ideal, please fix this issue.

@mdaguete
Copy link

mdaguete commented Feb 1, 2021

After some checks it seems that is related to the ingestion traffic, as I wrote before. After restarting the influxdb daemon the issue persist, but after cloning the machine and starting it without external traffic I was able to do the backup.

@heinrf
Copy link

heinrf commented Mar 8, 2021

I have got the same problem with 1.7.7 version connected to http://localhost:8086. In my case there was simply no data in the influx shard. When i started our datalogger service on Win10 and tried this command again after it started
>influxd backup -portable -database M-02263C -shard 124 D:\XChange\Influx_TestBackup_M_2263C_Shard24

all was fine. The backup did what it should do.

@lesam
Copy link
Contributor

lesam commented Sep 7, 2021

Summary:

#16739 (comment) - I think this is the important part.

Download shard 122 failed copy backup to file: err=<nil>, n=0. Waiting 2s and retrying (0)... - this means there was no explicit error but we did not get any data while trying to get a snapshot of the shard to back up.

For some users the error happens when trying to create a hard link to download the snapshot from: ts=2020-04-27T20:52:13.824530Z lvl=info msg="error creating tsm hard link: \"link /var/lib/influxdb/data/myDB/autogen/681/000000023-000000002.tsm /var/lib/influxdb/data/myDB/autogen/681/14.tmp/000000023-000000002.tsm: operation not supported\"" log_id=0MRIgTX0000 service=snapshot.

I did learn that part of the problem is that I was mounting Azure file shares, which don't support hard links

That explains that part. The hardlinks are created so that the backup process on the server can download the files.

So the resolution should be that we properly return an error to the backup cli when hardlinks are not supported, instead of saying 'error=nil' .

@lesam lesam assigned lesam and unassigned dgnorton and ayang64 Sep 9, 2021
@lesam
Copy link
Contributor

lesam commented Sep 13, 2021

See also #22446 for 2.x issue

@davidby-influx
Copy link
Contributor

davidby-influx commented Sep 22, 2021

@lesam - we could generalize the fix for #16289 to support operating systems without hard links. GOOS or compile-time equivalents won't tell us in all cases whether we need to copy or link. Perhaps a configuration flag, or maybe we can have smart detection of the problem.

@lesam
Copy link
Contributor

lesam commented Sep 27, 2021

I like smart detection

@lesam lesam assigned davidby-influx and unassigned lesam Oct 4, 2021
@lesam
Copy link
Contributor

lesam commented Oct 4, 2021

#16739 (comment) is a great idea, thanks for agreeing to make it reality @davidby-influx !

lesam added a commit to lesam/influxdb that referenced this issue Oct 13, 2021
lesam added a commit that referenced this issue Oct 14, 2021
@davidby-influx
Copy link
Contributor

New logging when we detect a filesystem that does not support hard links and switch to making snapshot copies for backups:

2021-10-19T20:44:38.851335Z	info	Snapshot for path written	{"log_id": "0XI__xbl000", "engine": "tsm1", "trace_id": "0XI_ceql000", "op_name": "tsm1_cache_snapshot", "path": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1", "duration": "108.319ms"}
2021-10-19T20:44:38.851464Z	info	Cache snapshot (end)	{"log_id": "0XI__xbl000", "engine": "tsm1", "trace_id": "0XI_ceql000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "108.374ms"}
2021-10-19T20:44:38.865768Z	info	linking backup snapshots	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/000000001-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/1.tmp/000000001-000000001.tsm"}
2021-10-19T20:44:38.880481Z	info	file system does not support hard links, switching to copies for backup	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/000000001-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/1.tmp/000000001-000000001.tsm"}
2021-10-19T20:44:38.917500Z	info	copying backup snapshots	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/000000002-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/1.tmp/000000002-000000001.tsm"}
2021-10-19T20:44:38.961133Z	info	copying backup snapshots	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/000000003-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/1.tmp/000000003-000000001.tsm"}
2021-10-19T20:44:39.018015Z	info	copying backup snapshots	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/000000004-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/1.tmp/000000004-000000001.tsm"}
2021-10-19T20:44:39.056739Z	info	copying backup snapshots	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/000000005-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/1.tmp/000000005-000000001.tsm"}
2021-10-19T20:44:39.098784Z	info	copying backup snapshots	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/000000007-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/_internal/monitor/1/1.tmp/000000007-000000001.tsm"}
2021-10-19T20:44:39.448028Z	info	Cache snapshot (start)	{"log_id": "0XI__xbl000", "engine": "tsm1", "trace_id": "0XI_chb0000", "op_name": "tsm1_cache_snapshot", "op_event": "start"}
2021-10-19T20:44:39.575645Z	info	Snapshot for path written	{"log_id": "0XI__xbl000", "engine": "tsm1", "trace_id": "0XI_chb0000", "op_name": "tsm1_cache_snapshot", "path": "/azure-files/davidbyazurefiles/influx/.influxdb/data/littletest/autogen/2", "duration": "127.629ms"}
2021-10-19T20:44:39.575689Z	info	Cache snapshot (end)	{"log_id": "0XI__xbl000", "engine": "tsm1", "trace_id": "0XI_chb0000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "127.671ms"}
2021-10-19T20:44:39.590567Z	info	linking backup snapshots	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/littletest/autogen/2/000000001-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/littletest/autogen/2/1.tmp/000000001-000000001.tsm"}
2021-10-19T20:44:39.605001Z	info	file system does not support hard links, switching to copies for backup	{"log_id": "0XI__xbl000", "engine": "tsm1", "service": "filestore", "OldPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/littletest/autogen/2/000000001-000000001.tsm", "NewPath": "/azure-files/davidbyazurefiles/influx/.influxdb/data/littletest/autogen/2/1.tmp/000000001-000000001.tsm"}

davidby-influx added a commit that referenced this issue Oct 19, 2021
If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes #16739
@davidby-influx davidby-influx linked a pull request Oct 19, 2021 that will close this issue
3 tasks
davidby-influx added a commit that referenced this issue Oct 21, 2021
If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes #16739
davidby-influx added a commit that referenced this issue Oct 21, 2021
If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes #16739
davidby-influx added a commit that referenced this issue Oct 21, 2021
If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes #16739

(cherry picked from commit d9b9e86)

closes #22700
davidby-influx added a commit that referenced this issue Oct 21, 2021
…22730)

If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes #16739

(cherry picked from commit d9b9e86)

closes #22700
davidby-influx added a commit that referenced this issue Oct 21, 2021
If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes #16739

(cherry picked from commit d9b9e86)

closes #22701
davidby-influx added a commit that referenced this issue Oct 22, 2021
…22732)

If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes #16739

(cherry picked from commit d9b9e86)

closes #22701
@wogeo
Copy link

wogeo commented Feb 18, 2023

I made a preliminary attempt. In my own version, I found that the permissions of the data directory have been changed to root。 i revert the permissions。 it is running

@afalldin
Copy link

I had a similar backup problem in v2.6.1, where after a few shards been backed up it suddenly abends with:
Error: failed to backup bucket data: failed to download snapshot of shard 21: unexpected EOF

Eyecatcher in the log suggesting backup hit a limit:
msg="internal error not returned to client" log_id=0g9AhmQG000 handler=error_logger error="write tcp xx.xx.xx.xx:8086->yy.yy.yy.yy:50886: i/o timeout"

In this case it was because backup hit the http-write-timeout limit, removing this from config.toml solved the problem.

chengshiwen pushed a commit to chengshiwen/influxdb that referenced this issue Aug 27, 2024
…#22703)

If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes influxdata#16739
chengshiwen pushed a commit to chengshiwen/influxdb-cluster that referenced this issue Aug 28, 2024
If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes influxdata/influxdb#16739
zhu733756 pushed a commit to zhu733756/influxdb-cluster that referenced this issue Nov 18, 2024
update readme

fix(influx_inspect): multiple retention policies problem in influx_inspect export (influxdata/influxdb#23197)

add github workflows

support features and bugfixes

release v1.8.10-c1.1.0

upgrade raft to 1.3.11

support allow-out-of-order-writes

fix announcement concurrent map iteration and map write in meta handler

fix and improve hinted handoff

- optimize the load order of segments in hh
- optimize marshalWrite and unmarshalWrite in node processor
- support max-writes-pending in hh
- do not queue partial write errors to hinted handoff
- prevent the hinted handoff from becoming blocked if it encounters field type errors
- fix issue where read bytes, blocked writes and dropped writes were not recorded in hh
- ensure the hinted handoff (hh) queue makes forward progress when segment errors occur
- verify and truncate the queue segment files if any are corrupted upon node startup
- improve hinted handoff metrics
- append bytes to buffer in hh queue to avoid OOM

release v1.8.10-c1.1.1

support storage reads merge and stream reader

optimize shard not found tips influxd-ctl show-shards

optimize nodeID value to tcp addr in statistics

adjust hinted handoff logs

support flux language query and fix influxql statement in cluster

- support flux language query
- support show queries and kill query statement
- fix explain and explain analyze statement
- fix show measurement cardinality and show series cardinality statement

release v1.8.10-c1.1.2

fix bug: null pointer from hh caused panic chengshiwen#3

update architecture link in readme

fix nil pointer dereference in marshal binary of join cluster response chengshiwen#15

avoid panic in processing write shard request and correct the comments in rpc.go

fix possible deadlock in client_pool

fix possible deadlock in node_processor chengshiwen#42

fix data race in meta client

optimize meta tests

optimize test, build, docker and workflows

update docker image
fix influxdb-builder dockerfile
remove git and jenkins dockerfile
fix dockerfile build
fix test.sh

chore: upgrade to go 1.21.13

fix syntax sugar under 1.21

replace +build with go:build

update .gitignore

chore: increase timer to 5 seconds (#22664)

fix: detect misquoted tag values and return an error (#22754) (#22787)

SHOW TAG KEYS FROM "foo" where bar="misquoted" is
erroneous, because the tag value must be enclosed
in single, not double, quotes. Although this
correctly returns no tag keys, it is very
inefficient and has cause out-of-memory failures
at a customer. This fix short-circuits the query.

closes influxdata/influxdb#22755

(cherry picked from commit af9e89a4d46b7f83623b3cdd5996fcccc39609e8)

closes influxdata/influxdb#22786

fix(tsm1): Fix temp directory search bug (#17685)

* fix: verify precision parameter in write requests

This change updates the HTTP endpoints that service v1 and v2 writes to
verify the values passed in the precision parameter.

* fix(tsm1): Fix temp directory search bug

The original code's intention is to scan a directory for the directory
with the higest value when converted to an integer.

So directories may be in the form:

  0.tmp
  1.tmp
  2.tmp
  30.tmp
  ...
  100.tmp

The loop should scan the directory, strip the basename and extension
from the file name to leave just a number, then store the higest number
it finds.

Before this patch, there is a bug that has the code only store the
higest value if there is an error converting the numeric value into an
integer.

This patch primarily fixes that logic.

In addition, this patch will save an indent level by inverting logic in
two places:

Instead if checkig if a file is a directory and has a suffix of ".tmp",
it is probably better to test if a file is NOT a directory OR does NOT
have an extension of ".tmp" then continue.

Also, instead of testig if len(ss) == 2, we can test if len(ss) != 2 and
continue if so.

Both of these save an indent level and keeps our "happy path" to the
left.

Finally, this patch will use string concatination instead of calling
fmt.Sprintf() to add periods to "tmp" and "tsm" extension.

Co-authored-by: David Norton <dgnorton@gmail.com>

fix(httpd): Fixes key collisions when serializing /debug/vars

fix(tsdb): Fix variables masked by a declaration (#18129)

Before this commit, the to and from variables were being re-declared in
a block in such a way that the values were not being used.

This patch uses regular assignment so that the values are visable
outside of the block where they're set.

Closes: 18128

refactor(query): reuse matchAllRegex (#18146)

matchAllRegex is a global variable containing the precompiled regex that
matches ".+".

Prior to this commit, it was used in only one place and we called its
.Copy() method.

According to the docs, .Copy() is no longer needed for safe concurrent
access:

  Deprecated: In earlier releases, when using a Regexp in multiple
  goroutines, giving each goroutine its own copy helped to avoid lock
  contention. As of Go 1.12, using Copy is no longer necessary to avoid
  lock contention. Copy may still be appropriate if the reason for its
  use is to make two copies with different Longest settings.

Since we require Go 1.13 or later now and we're not calling the
Longest() method, this patch removes the .Copy() call.

Now that we have a reusable matchAllRegex value, this patch then
replaces all instances of regexp.MustCompile(`.+`) with matchAllRegex.
This will elminate runtime regex compilations.

fix(httpd): add option to authenticate prometheus remote read (#18429)

fix(CORS): allow PATCH

refactor: Use filepath.Walk (#19514)

Prior to this commit, we had our own recursive file walker which
required a condition based on if s.Config.TypesDB pointed to a directory
or a regular file.

This commit replaces our own readdir() with filepath.Walk() and treats
recursing directories and loading one file as a single case.  This
simplifies the code quite a bit.

fix: minor test fixes for go1.15 and also flaky timeouts

Also run gofmt

fix: consistent error for missing shard (#20694)

fix(tsm1): fix data race and validation in cache ring (#20802)

Co-authored-by: Yun Zhao <zhaoyun2316@gmail.com>

Co-authored-by: Yun Zhao <zhaoyun2316@gmail.com>

fix: do not send non-UTF-8 characters to subscriptions (#21558)

Added a check for valid UTF-8 strings in measurement names,
tags name, tag values, and field names when writing to subscriptions.
Do not send the failing points to subscribers, and log the errors if at
debug level logging

Closes influxdata/influxdb#21557

fix: avoid rewriting fields.idx unnecessarily (#21592)

Under heavy write load creating new fields and measurements
the rewrite of the fields.idx file is a bottleneck. This
enhancement combines multiple writes into a single one and
shares any error return value with all of the combined
invocations. MeasurementFieldSet and the new
MeasurementFieldSetWriter must both now be explicitly
closed.

Closes #21577

fix: Do not close connection twice in DigestWithOptions (#21659)

tsm1.DigestWithOptions closes its network connection
twice. This may cause broken pipe errors on concurrent
invocations of the same procedure, by closing a reused
i/o descriptor. This fix also captures errors from TSM
file closures, which were previously ignored.

Closes influxdata/influxdb#21656

fix: do not panic on cleaning up failed iterators (#21666)

We have seen occasional panics in Iterators.Close()
when cleaning up after failed iterator creation.
This commit checks for nil on any iterator to be
closed, and now returns any errors generated by
that Close().

Closes influxdata/influxdb#19579
Closes influxdata/influxdb#19476

fix: error instead of panic for statement rewrite failure (#21792)

fix: show shards gives empty expiry time for inf duration shards (#21795)

fix: a few suddenly flaky tests involving randomness (#21818)

Closes #21817

fix: copy names from mmapped memory before closing iterator (#22040)

This fix ensures that memory-mapped files are not released
before pointers into them are copied into heap memory.
MeasurementNamesByExpr() and MeasurementNamesByPredicate() can
cause panics by copying memory from mmapped files that have been
released. The functions they call use iterators to files which
are closed (releasing the mmapped files) before the memory is
safely copied to the heap.

closes influxdata/influxdb#22000

test: fix order of index teardown (#22038)

fix: return correct count of ErrNotExecuted (#22273)

executeQuery() iterates over statements until each is
processed or if an error is encountered that causes
the loop to exit pre-maturely. It should return
ErrNotExecuted for each remaining statement in the
query

closes influxdata/influxdb#19136

fix: correct error return shadowing (#22353)

fix: flux error properly read by cloud (#22348)

fix: For Windows, close temp file before removing (#22492)

closes influxdata/influxdb#21470

fix(tsdb): sync series segment to disk after writing (#22566)

fix: create shards without overlaps (#22601)

chore: fix deadlock in `influx_inspect dumptsi` (#22661)

fix: extend snapshot copy to filesystems that cannot link (#22703)

If os.Link fails with syscall.ENOTSUP, then the file
system does not support links, and we must make copies
to snapshot files for backup. We also automatically make
copies instead of link on Windows, because although it
makes links, their semantics are different from Linux.

closes influxdata/influxdb#16739

fix(restore): fix race condition which causes restore command to fail (#22796)

* fix(restore): fix race condition which causes restore command to fail

Fixes a race condition in the restore code path that causes shard data restores
to fail. When the bug occurs, `Error while freeing cold shard resources`
appears in the log files.

fixes issue #15323

fix(tsi): sync index file before close (#21932)

fix: `influxd-ctl backup` will create a working backup when only `-shard` given (#22998)

`influxd-ctl backup` will now create a working backup when only the `-shard`
option is given. Previously this would create a backup that could not be
restored.

fixes #16751

fix: return underlying error creating a subscription (#23217)

When creating a subscription, return the wrapped error
on failure

closes influxdata/influxdb#23216

build: upgrade to go1.18 (#23250)

fix(security): bump several dependencies to fix security issues

chore: fix typo in config.sample.toml (#21125)

fix(httpd): abort processing write request when encountering a precision error (#21746)

fix: MeasurementsCardinality should not be less than 0 (#23286)

Clamp the value of Store.MeasurementsCardinality so that it can not be less
than 0. This primarily shows up as a negative `numMeasurements` value in
/debug/vars under some circumstances.

refs #23285

fix: remove data directory appending for influx_inspect verify (#23336)

influx_inspect verify -dir will no longer append the "/data" path to the dir.  Files are checked recursively, so this will still include files in the "/data" path as well as other subdirectories.

closes influxdata/influxdb#22572

fix: replace unprintable and invalid characters in errors (#23387)

Replace unprintable and invalid characters with '?'
in logged errors.  Truncate consecutive runs of them to
only 3 repeats of '?'

closes influxdata/influxdb#23386

refactor: Use binary.Read() instead of io.ReadFull() (#19323)

The original version of verifyVersion() reads into a byte slice,
manually ensures its byte order, then converts it to a type comparable
with Version and MagicNumber.

This patch hides those details by calling binary.Read() and reading
values into properly typed variables.

This adds a bit of overhead but this code isn't in the hot-path and this
patch greatly simplifies the code.

verifyVersion() originally accepted an io.ReadSeeker.  It is only called
in once place and that function immediately calls seek after
verifyVersion(), therefore it is probably safe to call Seek() BEFORE
verifyVersion().

The benefit is that verifyVersion() is easier to test since we can pass
it a bytes.Buffer.

This patch adds a test for verifyVersion() as well as a benchmark.

benchmark                    old ns/op     new ns/op     delta
BenchmarkVerifyVersion-8     73.5          123           +67.35%

Finally, this commit moves verifyVersion() from writer.go to reader.go
which is where it is actually used.

fix: do not rename files on mmap failure (#23396)

If NewTSMReader() fails because mmap fails, do not
rename the file, because the error is probably
caused by vm.max_map_count being too low

closes influxdata/influxdb#23172

fix: fully clean up partially opened TSI (#23430)

When one partition in a TSI fails to open, all previously opened
partitions should be cleaned up, and remaining partitions
should not be opened

closes influxdata/influxdb#23427

fix: remember shards that fail Open(), avoid repeated attempts (#23437)

If a shard cannot be opened, store its ID and last error.
Prevent future attempts to open during this invocation of
influxDB. This information is not persisted.

closes influxdata/influxdb#23428
closes influxdata/influxdb#23426

fix: lost TSI reference / close TagValueSeriesIDIterator in error case (#23461) (#23462)

(cherry picked from commit 8bd4fc502d12a0e2ece10eb86e64832646640cda)

closes influxdata/influxdb#23460

Co-authored-by: Dane Strandboge <dstrandboge@influxdata.com>

fix: eliminate race condition on Monitor.globalTags (#23467)

fixes #23466

fix: improve error messages opening index partitions (#23532)

Where possible, add the file path path to any errors
on opening, reading, (un)marshaling, or validating
the various files comprising a partition

closes influxdata/influxdb#23506

fix: create TSI MANIFEST files atomically (#23539)

When a MANIFEST file is created in TSI, it
should be written to a temp file, then
atomically renamed, to avoid overwriting
the existing file only to fail on the
later write.

closes influxdata/influxdb#23536

fix: add paths to tsi log and index file errors (#23557)

Add paths to various TSI errors on opening and unmarshaling files
to help poinpoint the corrupt files.

Closes influxdata/influxdb#23556

fix: add reporttsi to the help text (#23566)

reporttsi was not listed as a command in the influx_inspect help text.

fix: generalize test for Windows (#23580)

Also eliminate race condition in tests

(cherry picked from commit 7e37a7ad1610771409d9b651a775f3a4ab4352ed)

fix: use copy when a rename spans volumes (#23785)

When a file rename fails with EXDEV
(cross device or volume error), copy the
file and delete the original instead

Differs from master branch by overwriting
existing files instead of erring.

closes influxdata/influxdb#22997

fix: add tests for file rename across volumes (#23787)

Also move shared code from file_unix.go

fix: log errors in continuous query statistics storage (#23822)

fix: don't write skipped shard messages to the line protocol output destination (#23727) (#23885)

This switches so that the message

    skipped missing file: /path/to/tsm.tsm

is written to stdErr instead of stdout (or the output file if `-out` has been provided)

(cherry picked from commit a9bf1d54c1eb77c9b5eafef9d1d3db7511ded9c8)

closes influxdata/influxdb#23866

Co-authored-by: Ben Tasker <88340935+btasker@users.noreply.github.com>

chore: upgrade Go to v1.19.3 (1.x) (#23941)

* chore: upgrade Go to 1.19.3

This re-runs ./generate.sh and ./checkfmt.sh to format and update
source code (this is primarily responsible for the huge diff.)

* fix: update tests to reflect sorting algorithm change

fix: series file index compaction (#23916)

Series file indices monotonically grew even
when series were deleted.  Also stop
ignoring error in series index recovery

Partially closes https://github.com/influxdata/EAR/issues/3643

fix: do not escape CSV output (#24311)

CSV output is incorrectly escaped.
Add a boolean flag to tag output
functions to prevent this.

closes influxdata/influxdb#24309

fix: avoid SIGBUS when reading non-std series segment files (#24509)

Some series files which are smaller than the standard
sizes cause SIGBUS in influx_inspect and influxd, because
entry iteration walks onto mapped memory not backed by the
the file.  Avoid walking off the end of the file while
iterating series entries in oddly sized files.

closes influxdata/influxdb#24508

Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com>

fix: panic index out of range for invalid series keys (#24565)

* chore: add scaffolding for naive solution

* feat: test case scaffolding

* fix: implement check for series key before proceeding

* fix: add validation for ReadSeriesKeyMeasurement usage

* refactor: explicit use of series key len

* feat: add remaining check to index

* feat: add check to remaining files

As the Len function is used as part of the parseSeriesKey, this also needs to be accounted for on the nil return from this function as it is used in different contexts

* feat: expand test cases

* chore: go fmt

* chore: update test failure message

* chore: impl feedback on unnecessary sz checks

* feat: expand test cases

* fix: nil series key check

In both sections for index.go there is a pre-existing length check against the series key which should catch invalid values, perhaps this explains why it hasn't cropped up in the reported panics. For even more safety, we can also skip a nil key because we know that subsequent calls will cause a panic where this key is attempted to be used

* fix: remove nil tags check

A key with no tags is valid, so we should not check for BOTH nil key and tags as a key could be nil, which is invalid, yet still have tags and therefore cause the check to pass which we do not want

* feat: extend test cases from feedback

* fix: extend checks for CompareSeriesKeys

* feat: add nilKeyHandler for shared key checking logic

* fix: logical error in nilKeyHandler

Prior to this, the else was always defaulted to at the end of the conditional branch, which causes unexpected behaviour and a failure of a bunch of tests.

* fix: return tags keep nil data

In a recent change to this, we agreed on a simple name == nil check for the actual data. As a follow on to this, I just realised that we don't actually want to nil back the tags, even if they're not checked, because having no tags is a valid input so we can simply return whatever we were passed unchanged.

* fix: use len == 0 for extra safety

* feat: extra test for blank series key

chore: upgrade to influxdata/influxql v1.2.0 (#24764)

Upgrade to influxdata/influxql v1.2.0. While it does not fix any
known issues in InfluxDB OSS 1.x, it is necessary for upstream
projects impacted by influxdata/influxql#65.

In addition to upgrading influxdata/influxql, this also updates test
cases that relied on the erroneous precision handling when stringifying
InfluxQL ASTs.

Visible impacts to InfluxDB OSS 1.x:
- Changes precision of floating point numbers in error messages
  related to InfluxQL
- Changes precision of floating point numbers in "EXPLAIN" and
  "EXPLAIN ANALYZE" output
- Changes precision of floating point numbers from InfluxQL
  expressions included in tracing spans

closes: #24763

fix: do not panic when empty tags are queried (#24784)

Do not panic if a cursor array is nil and the number
of timestamps is retrieved.

closes influxdata/influxdb#24536

fix: improved shard deletion (#24602)

Avoid unnecessarily deleting series from the series file
Try harder to delete series from InMem indices
Log all errors on shard deletion

Closes influxdata/influxdb#24834

fix: ensure TSMBatchKeyIterator and FileStore close all TSMReaders (#24957)

Do not let errors on closing
a TSMReader prevent other
closes.

fix: return MergeIterator.Close errors (#24975)

Ensure that errors from closing the
iterators underneath a MergeIterator
are returned up the stack.

fix: ignore empty index error deleting last measurement (#25037)

An empty index is appropriate when deleting the last
measurement.  Also clean up error handling, avoid
duplicate calls to Close.

closes influxdata/influxdb#9929

fix: GROUP BY queries with offset that crosses a DST boundary fail. (#25082)

This is actually the second fix for
influxdata/influxdb#20238
for when the time zone falls back in autumn.

closes influxdata/influxdb#25078

fix: Store.validateArgs wrongfully overwriting start, end unix time (#25146) (#25165)

When querying data before 1970-01-01 (UNIX time 0)
validateArgs would set start to -in64 max and end to int64 max.

closes influxdata/influxdb#24669

Co-authored-by: Paul Hegenberg <paul.hegenberg@gmail.com>

closes influxdata/influxdb#25149

fix(tsi1): fix data race between appendEntry and FlushAndSync tsi1.(*LogFile) (#25182)

Extend lock lifespan to encompass the
flushAndSync() call to avoid a race

closes influxdata/influxdb#25181

fix: require database authorization to see continuous queries (#22283)

SHOW CONTINUOUS QUERIES now requires the same permissions
as SHOW DATABASES in order to see the continuous queries
in a database.

closes influxdata/influxdb#10292

feat: Add WITH KEY to show tag keys (#20793)

* fix: Change from RewriteExpr to PartitionExpr

Also remove some dead code

* feat: WITH KEY implementation

* feat: query rewriting for WITH KEY in SHOW TAG KEYS

feat: SHOW TAG VALUES should produce results from one specific RP (#21983)

Ensure that the Sources field of the ShowTagValuesStatement is
filled in. Then use the sources to limit the retention policies,
and thus the shards from which tag values are collected.

This fix only works on TSI databases; INMEM shards share
indices, so restricting shard indices used does not restrict the
tag values returned.

This will not permit multiple retention policies to be specified in
a query; either all RPs or one are permitted.

Closes influxdata/influxdb#21981

feat: show measurements database and retention policy wildcards (#22388)

* feat: show measurements database and retention policy wildcards

Closes #3318

feat: add thread-safe access to CountingWriter byte total (#22620)

Use atomic operations to update and report CountingWriter.Total through a new method.

closes influxdata/influxdb#22618

feat: optionally dump queries to log on SIGTERM (#22638)

Dump all active queries to the log when a SIGTERM
is received and the termination-query-log flag is
true in the coordinator section of the config. The
default is false.

feat: configurable DELETE concurrency (#23055)

Currently, deletion of series or measurements are
serialized. This new feature will add
max-concurrent-deletes to the [data] section of the
 configuration file. Legal values are any positive
 number, defaulting to 1, the current behavior.

 closes influxdata/influxdb#23054

feat: log slow queries even without query logging (#23320)

Log long-running queries if "log-queries-after" > 0,
even if general query logging is not enabled.

closes influxdata/influxdb#23147

feat: log the log level regardless of log level (#23425)

feat: add version number to debug/vars (#23795)

closes influxdata/influxdb#23793

feat: add the ability to log queries killed by `query-timeout` (#23978)

* feat: add the ability to log queries killed by `query-timeout`

* chore: update example config

* chore: improve logging details

chore: update changelog for 1.8

chore: add checkfmt.sh in workflows

chore: fix indent and typo in config.sample.toml and meta.config.sample.toml

fix race condition in meta client

rename ShowShards to ListShards

use rand.New(rand.NewSource) instead of deprecated rand.Seed

gossip /announce will try to request other nodes when network failed

remove any expired shards and give empty expiry time for inf duration shards from the /show-shards output

fix read tcp i/o timeout in copy-shard chengshiwen#39

optimize connection pool and test

* fixes data-race between Get() and Close() of channelPool
* using sync.RWMutex instead of sync.Mutex in boundedPool
* refactor idle timeout in pool
* add connection pool test
* replace fatih/pool.v2 with custom pool

optimize meta executor and shard writer

fix influxd close and reset issue

fix(coordinator): fix closing opened twice in points writer

optimize logger output under write timeout failed

fix: fix typo 'exceeed' with 'exceeded'

optimize total channel to avoid blocking in pool

optimize build and test

fix(tsm1): Fix data race of seriesKeys in deleteSeriesRange

fix(config): fix max-concurrent-deletes typo in config.sample.toml

release v1.8.11-c1.2.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment