-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully clean up partially opened TSI. #23427
Comments
lesam
added a commit
to lesam/influxdb
that referenced
this issue
Jun 10, 2022
3 tasks
lesam
added a commit
to lesam/influxdb
that referenced
this issue
Jun 10, 2022
davidby-influx
added a commit
that referenced
this issue
Jun 10, 2022
When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened closes #23427
3 tasks
davidby-influx
added a commit
that referenced
this issue
Jun 10, 2022
When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened closes #23427
This was referenced Jun 10, 2022
davidby-influx
added a commit
that referenced
this issue
Jun 13, 2022
3 tasks
davidby-influx
added a commit
that referenced
this issue
Jun 13, 2022
davidby-influx
added a commit
that referenced
this issue
Jun 14, 2022
3 tasks
davidby-influx
added a commit
that referenced
this issue
Jun 14, 2022
chengshiwen
pushed a commit
to chengshiwen/influxdb
that referenced
this issue
Aug 11, 2024
When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened closes influxdata#23427
chengshiwen
pushed a commit
to chengshiwen/influxdb-cluster
that referenced
this issue
Aug 25, 2024
When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened closes influxdata/influxdb#23427
chengshiwen
pushed a commit
to chengshiwen/influxdb
that referenced
this issue
Aug 27, 2024
When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened closes influxdata#23427
chengshiwen
pushed a commit
to chengshiwen/influxdb-cluster
that referenced
this issue
Aug 28, 2024
When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened closes influxdata/influxdb#23427
zhu733756
pushed a commit
to zhu733756/influxdb-cluster
that referenced
this issue
Nov 18, 2024
update readme fix(influx_inspect): multiple retention policies problem in influx_inspect export (influxdata/influxdb#23197) add github workflows support features and bugfixes release v1.8.10-c1.1.0 upgrade raft to 1.3.11 support allow-out-of-order-writes fix announcement concurrent map iteration and map write in meta handler fix and improve hinted handoff - optimize the load order of segments in hh - optimize marshalWrite and unmarshalWrite in node processor - support max-writes-pending in hh - do not queue partial write errors to hinted handoff - prevent the hinted handoff from becoming blocked if it encounters field type errors - fix issue where read bytes, blocked writes and dropped writes were not recorded in hh - ensure the hinted handoff (hh) queue makes forward progress when segment errors occur - verify and truncate the queue segment files if any are corrupted upon node startup - improve hinted handoff metrics - append bytes to buffer in hh queue to avoid OOM release v1.8.10-c1.1.1 support storage reads merge and stream reader optimize shard not found tips influxd-ctl show-shards optimize nodeID value to tcp addr in statistics adjust hinted handoff logs support flux language query and fix influxql statement in cluster - support flux language query - support show queries and kill query statement - fix explain and explain analyze statement - fix show measurement cardinality and show series cardinality statement release v1.8.10-c1.1.2 fix bug: null pointer from hh caused panic chengshiwen#3 update architecture link in readme fix nil pointer dereference in marshal binary of join cluster response chengshiwen#15 avoid panic in processing write shard request and correct the comments in rpc.go fix possible deadlock in client_pool fix possible deadlock in node_processor chengshiwen#42 fix data race in meta client optimize meta tests optimize test, build, docker and workflows update docker image fix influxdb-builder dockerfile remove git and jenkins dockerfile fix dockerfile build fix test.sh chore: upgrade to go 1.21.13 fix syntax sugar under 1.21 replace +build with go:build update .gitignore chore: increase timer to 5 seconds (#22664) fix: detect misquoted tag values and return an error (#22754) (#22787) SHOW TAG KEYS FROM "foo" where bar="misquoted" is erroneous, because the tag value must be enclosed in single, not double, quotes. Although this correctly returns no tag keys, it is very inefficient and has cause out-of-memory failures at a customer. This fix short-circuits the query. closes influxdata/influxdb#22755 (cherry picked from commit af9e89a4d46b7f83623b3cdd5996fcccc39609e8) closes influxdata/influxdb#22786 fix(tsm1): Fix temp directory search bug (#17685) * fix: verify precision parameter in write requests This change updates the HTTP endpoints that service v1 and v2 writes to verify the values passed in the precision parameter. * fix(tsm1): Fix temp directory search bug The original code's intention is to scan a directory for the directory with the higest value when converted to an integer. So directories may be in the form: 0.tmp 1.tmp 2.tmp 30.tmp ... 100.tmp The loop should scan the directory, strip the basename and extension from the file name to leave just a number, then store the higest number it finds. Before this patch, there is a bug that has the code only store the higest value if there is an error converting the numeric value into an integer. This patch primarily fixes that logic. In addition, this patch will save an indent level by inverting logic in two places: Instead if checkig if a file is a directory and has a suffix of ".tmp", it is probably better to test if a file is NOT a directory OR does NOT have an extension of ".tmp" then continue. Also, instead of testig if len(ss) == 2, we can test if len(ss) != 2 and continue if so. Both of these save an indent level and keeps our "happy path" to the left. Finally, this patch will use string concatination instead of calling fmt.Sprintf() to add periods to "tmp" and "tsm" extension. Co-authored-by: David Norton <dgnorton@gmail.com> fix(httpd): Fixes key collisions when serializing /debug/vars fix(tsdb): Fix variables masked by a declaration (#18129) Before this commit, the to and from variables were being re-declared in a block in such a way that the values were not being used. This patch uses regular assignment so that the values are visable outside of the block where they're set. Closes: 18128 refactor(query): reuse matchAllRegex (#18146) matchAllRegex is a global variable containing the precompiled regex that matches ".+". Prior to this commit, it was used in only one place and we called its .Copy() method. According to the docs, .Copy() is no longer needed for safe concurrent access: Deprecated: In earlier releases, when using a Regexp in multiple goroutines, giving each goroutine its own copy helped to avoid lock contention. As of Go 1.12, using Copy is no longer necessary to avoid lock contention. Copy may still be appropriate if the reason for its use is to make two copies with different Longest settings. Since we require Go 1.13 or later now and we're not calling the Longest() method, this patch removes the .Copy() call. Now that we have a reusable matchAllRegex value, this patch then replaces all instances of regexp.MustCompile(`.+`) with matchAllRegex. This will elminate runtime regex compilations. fix(httpd): add option to authenticate prometheus remote read (#18429) fix(CORS): allow PATCH refactor: Use filepath.Walk (#19514) Prior to this commit, we had our own recursive file walker which required a condition based on if s.Config.TypesDB pointed to a directory or a regular file. This commit replaces our own readdir() with filepath.Walk() and treats recursing directories and loading one file as a single case. This simplifies the code quite a bit. fix: minor test fixes for go1.15 and also flaky timeouts Also run gofmt fix: consistent error for missing shard (#20694) fix(tsm1): fix data race and validation in cache ring (#20802) Co-authored-by: Yun Zhao <zhaoyun2316@gmail.com> Co-authored-by: Yun Zhao <zhaoyun2316@gmail.com> fix: do not send non-UTF-8 characters to subscriptions (#21558) Added a check for valid UTF-8 strings in measurement names, tags name, tag values, and field names when writing to subscriptions. Do not send the failing points to subscribers, and log the errors if at debug level logging Closes influxdata/influxdb#21557 fix: avoid rewriting fields.idx unnecessarily (#21592) Under heavy write load creating new fields and measurements the rewrite of the fields.idx file is a bottleneck. This enhancement combines multiple writes into a single one and shares any error return value with all of the combined invocations. MeasurementFieldSet and the new MeasurementFieldSetWriter must both now be explicitly closed. Closes #21577 fix: Do not close connection twice in DigestWithOptions (#21659) tsm1.DigestWithOptions closes its network connection twice. This may cause broken pipe errors on concurrent invocations of the same procedure, by closing a reused i/o descriptor. This fix also captures errors from TSM file closures, which were previously ignored. Closes influxdata/influxdb#21656 fix: do not panic on cleaning up failed iterators (#21666) We have seen occasional panics in Iterators.Close() when cleaning up after failed iterator creation. This commit checks for nil on any iterator to be closed, and now returns any errors generated by that Close(). Closes influxdata/influxdb#19579 Closes influxdata/influxdb#19476 fix: error instead of panic for statement rewrite failure (#21792) fix: show shards gives empty expiry time for inf duration shards (#21795) fix: a few suddenly flaky tests involving randomness (#21818) Closes #21817 fix: copy names from mmapped memory before closing iterator (#22040) This fix ensures that memory-mapped files are not released before pointers into them are copied into heap memory. MeasurementNamesByExpr() and MeasurementNamesByPredicate() can cause panics by copying memory from mmapped files that have been released. The functions they call use iterators to files which are closed (releasing the mmapped files) before the memory is safely copied to the heap. closes influxdata/influxdb#22000 test: fix order of index teardown (#22038) fix: return correct count of ErrNotExecuted (#22273) executeQuery() iterates over statements until each is processed or if an error is encountered that causes the loop to exit pre-maturely. It should return ErrNotExecuted for each remaining statement in the query closes influxdata/influxdb#19136 fix: correct error return shadowing (#22353) fix: flux error properly read by cloud (#22348) fix: For Windows, close temp file before removing (#22492) closes influxdata/influxdb#21470 fix(tsdb): sync series segment to disk after writing (#22566) fix: create shards without overlaps (#22601) chore: fix deadlock in `influx_inspect dumptsi` (#22661) fix: extend snapshot copy to filesystems that cannot link (#22703) If os.Link fails with syscall.ENOTSUP, then the file system does not support links, and we must make copies to snapshot files for backup. We also automatically make copies instead of link on Windows, because although it makes links, their semantics are different from Linux. closes influxdata/influxdb#16739 fix(restore): fix race condition which causes restore command to fail (#22796) * fix(restore): fix race condition which causes restore command to fail Fixes a race condition in the restore code path that causes shard data restores to fail. When the bug occurs, `Error while freeing cold shard resources` appears in the log files. fixes issue #15323 fix(tsi): sync index file before close (#21932) fix: `influxd-ctl backup` will create a working backup when only `-shard` given (#22998) `influxd-ctl backup` will now create a working backup when only the `-shard` option is given. Previously this would create a backup that could not be restored. fixes #16751 fix: return underlying error creating a subscription (#23217) When creating a subscription, return the wrapped error on failure closes influxdata/influxdb#23216 build: upgrade to go1.18 (#23250) fix(security): bump several dependencies to fix security issues chore: fix typo in config.sample.toml (#21125) fix(httpd): abort processing write request when encountering a precision error (#21746) fix: MeasurementsCardinality should not be less than 0 (#23286) Clamp the value of Store.MeasurementsCardinality so that it can not be less than 0. This primarily shows up as a negative `numMeasurements` value in /debug/vars under some circumstances. refs #23285 fix: remove data directory appending for influx_inspect verify (#23336) influx_inspect verify -dir will no longer append the "/data" path to the dir. Files are checked recursively, so this will still include files in the "/data" path as well as other subdirectories. closes influxdata/influxdb#22572 fix: replace unprintable and invalid characters in errors (#23387) Replace unprintable and invalid characters with '?' in logged errors. Truncate consecutive runs of them to only 3 repeats of '?' closes influxdata/influxdb#23386 refactor: Use binary.Read() instead of io.ReadFull() (#19323) The original version of verifyVersion() reads into a byte slice, manually ensures its byte order, then converts it to a type comparable with Version and MagicNumber. This patch hides those details by calling binary.Read() and reading values into properly typed variables. This adds a bit of overhead but this code isn't in the hot-path and this patch greatly simplifies the code. verifyVersion() originally accepted an io.ReadSeeker. It is only called in once place and that function immediately calls seek after verifyVersion(), therefore it is probably safe to call Seek() BEFORE verifyVersion(). The benefit is that verifyVersion() is easier to test since we can pass it a bytes.Buffer. This patch adds a test for verifyVersion() as well as a benchmark. benchmark old ns/op new ns/op delta BenchmarkVerifyVersion-8 73.5 123 +67.35% Finally, this commit moves verifyVersion() from writer.go to reader.go which is where it is actually used. fix: do not rename files on mmap failure (#23396) If NewTSMReader() fails because mmap fails, do not rename the file, because the error is probably caused by vm.max_map_count being too low closes influxdata/influxdb#23172 fix: fully clean up partially opened TSI (#23430) When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened closes influxdata/influxdb#23427 fix: remember shards that fail Open(), avoid repeated attempts (#23437) If a shard cannot be opened, store its ID and last error. Prevent future attempts to open during this invocation of influxDB. This information is not persisted. closes influxdata/influxdb#23428 closes influxdata/influxdb#23426 fix: lost TSI reference / close TagValueSeriesIDIterator in error case (#23461) (#23462) (cherry picked from commit 8bd4fc502d12a0e2ece10eb86e64832646640cda) closes influxdata/influxdb#23460 Co-authored-by: Dane Strandboge <dstrandboge@influxdata.com> fix: eliminate race condition on Monitor.globalTags (#23467) fixes #23466 fix: improve error messages opening index partitions (#23532) Where possible, add the file path path to any errors on opening, reading, (un)marshaling, or validating the various files comprising a partition closes influxdata/influxdb#23506 fix: create TSI MANIFEST files atomically (#23539) When a MANIFEST file is created in TSI, it should be written to a temp file, then atomically renamed, to avoid overwriting the existing file only to fail on the later write. closes influxdata/influxdb#23536 fix: add paths to tsi log and index file errors (#23557) Add paths to various TSI errors on opening and unmarshaling files to help poinpoint the corrupt files. Closes influxdata/influxdb#23556 fix: add reporttsi to the help text (#23566) reporttsi was not listed as a command in the influx_inspect help text. fix: generalize test for Windows (#23580) Also eliminate race condition in tests (cherry picked from commit 7e37a7ad1610771409d9b651a775f3a4ab4352ed) fix: use copy when a rename spans volumes (#23785) When a file rename fails with EXDEV (cross device or volume error), copy the file and delete the original instead Differs from master branch by overwriting existing files instead of erring. closes influxdata/influxdb#22997 fix: add tests for file rename across volumes (#23787) Also move shared code from file_unix.go fix: log errors in continuous query statistics storage (#23822) fix: don't write skipped shard messages to the line protocol output destination (#23727) (#23885) This switches so that the message skipped missing file: /path/to/tsm.tsm is written to stdErr instead of stdout (or the output file if `-out` has been provided) (cherry picked from commit a9bf1d54c1eb77c9b5eafef9d1d3db7511ded9c8) closes influxdata/influxdb#23866 Co-authored-by: Ben Tasker <88340935+btasker@users.noreply.github.com> chore: upgrade Go to v1.19.3 (1.x) (#23941) * chore: upgrade Go to 1.19.3 This re-runs ./generate.sh and ./checkfmt.sh to format and update source code (this is primarily responsible for the huge diff.) * fix: update tests to reflect sorting algorithm change fix: series file index compaction (#23916) Series file indices monotonically grew even when series were deleted. Also stop ignoring error in series index recovery Partially closes https://github.com/influxdata/EAR/issues/3643 fix: do not escape CSV output (#24311) CSV output is incorrectly escaped. Add a boolean flag to tag output functions to prevent this. closes influxdata/influxdb#24309 fix: avoid SIGBUS when reading non-std series segment files (#24509) Some series files which are smaller than the standard sizes cause SIGBUS in influx_inspect and influxd, because entry iteration walks onto mapped memory not backed by the the file. Avoid walking off the end of the file while iterating series entries in oddly sized files. closes influxdata/influxdb#24508 Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com> fix: panic index out of range for invalid series keys (#24565) * chore: add scaffolding for naive solution * feat: test case scaffolding * fix: implement check for series key before proceeding * fix: add validation for ReadSeriesKeyMeasurement usage * refactor: explicit use of series key len * feat: add remaining check to index * feat: add check to remaining files As the Len function is used as part of the parseSeriesKey, this also needs to be accounted for on the nil return from this function as it is used in different contexts * feat: expand test cases * chore: go fmt * chore: update test failure message * chore: impl feedback on unnecessary sz checks * feat: expand test cases * fix: nil series key check In both sections for index.go there is a pre-existing length check against the series key which should catch invalid values, perhaps this explains why it hasn't cropped up in the reported panics. For even more safety, we can also skip a nil key because we know that subsequent calls will cause a panic where this key is attempted to be used * fix: remove nil tags check A key with no tags is valid, so we should not check for BOTH nil key and tags as a key could be nil, which is invalid, yet still have tags and therefore cause the check to pass which we do not want * feat: extend test cases from feedback * fix: extend checks for CompareSeriesKeys * feat: add nilKeyHandler for shared key checking logic * fix: logical error in nilKeyHandler Prior to this, the else was always defaulted to at the end of the conditional branch, which causes unexpected behaviour and a failure of a bunch of tests. * fix: return tags keep nil data In a recent change to this, we agreed on a simple name == nil check for the actual data. As a follow on to this, I just realised that we don't actually want to nil back the tags, even if they're not checked, because having no tags is a valid input so we can simply return whatever we were passed unchanged. * fix: use len == 0 for extra safety * feat: extra test for blank series key chore: upgrade to influxdata/influxql v1.2.0 (#24764) Upgrade to influxdata/influxql v1.2.0. While it does not fix any known issues in InfluxDB OSS 1.x, it is necessary for upstream projects impacted by influxdata/influxql#65. In addition to upgrading influxdata/influxql, this also updates test cases that relied on the erroneous precision handling when stringifying InfluxQL ASTs. Visible impacts to InfluxDB OSS 1.x: - Changes precision of floating point numbers in error messages related to InfluxQL - Changes precision of floating point numbers in "EXPLAIN" and "EXPLAIN ANALYZE" output - Changes precision of floating point numbers from InfluxQL expressions included in tracing spans closes: #24763 fix: do not panic when empty tags are queried (#24784) Do not panic if a cursor array is nil and the number of timestamps is retrieved. closes influxdata/influxdb#24536 fix: improved shard deletion (#24602) Avoid unnecessarily deleting series from the series file Try harder to delete series from InMem indices Log all errors on shard deletion Closes influxdata/influxdb#24834 fix: ensure TSMBatchKeyIterator and FileStore close all TSMReaders (#24957) Do not let errors on closing a TSMReader prevent other closes. fix: return MergeIterator.Close errors (#24975) Ensure that errors from closing the iterators underneath a MergeIterator are returned up the stack. fix: ignore empty index error deleting last measurement (#25037) An empty index is appropriate when deleting the last measurement. Also clean up error handling, avoid duplicate calls to Close. closes influxdata/influxdb#9929 fix: GROUP BY queries with offset that crosses a DST boundary fail. (#25082) This is actually the second fix for influxdata/influxdb#20238 for when the time zone falls back in autumn. closes influxdata/influxdb#25078 fix: Store.validateArgs wrongfully overwriting start, end unix time (#25146) (#25165) When querying data before 1970-01-01 (UNIX time 0) validateArgs would set start to -in64 max and end to int64 max. closes influxdata/influxdb#24669 Co-authored-by: Paul Hegenberg <paul.hegenberg@gmail.com> closes influxdata/influxdb#25149 fix(tsi1): fix data race between appendEntry and FlushAndSync tsi1.(*LogFile) (#25182) Extend lock lifespan to encompass the flushAndSync() call to avoid a race closes influxdata/influxdb#25181 fix: require database authorization to see continuous queries (#22283) SHOW CONTINUOUS QUERIES now requires the same permissions as SHOW DATABASES in order to see the continuous queries in a database. closes influxdata/influxdb#10292 feat: Add WITH KEY to show tag keys (#20793) * fix: Change from RewriteExpr to PartitionExpr Also remove some dead code * feat: WITH KEY implementation * feat: query rewriting for WITH KEY in SHOW TAG KEYS feat: SHOW TAG VALUES should produce results from one specific RP (#21983) Ensure that the Sources field of the ShowTagValuesStatement is filled in. Then use the sources to limit the retention policies, and thus the shards from which tag values are collected. This fix only works on TSI databases; INMEM shards share indices, so restricting shard indices used does not restrict the tag values returned. This will not permit multiple retention policies to be specified in a query; either all RPs or one are permitted. Closes influxdata/influxdb#21981 feat: show measurements database and retention policy wildcards (#22388) * feat: show measurements database and retention policy wildcards Closes #3318 feat: add thread-safe access to CountingWriter byte total (#22620) Use atomic operations to update and report CountingWriter.Total through a new method. closes influxdata/influxdb#22618 feat: optionally dump queries to log on SIGTERM (#22638) Dump all active queries to the log when a SIGTERM is received and the termination-query-log flag is true in the coordinator section of the config. The default is false. feat: configurable DELETE concurrency (#23055) Currently, deletion of series or measurements are serialized. This new feature will add max-concurrent-deletes to the [data] section of the configuration file. Legal values are any positive number, defaulting to 1, the current behavior. closes influxdata/influxdb#23054 feat: log slow queries even without query logging (#23320) Log long-running queries if "log-queries-after" > 0, even if general query logging is not enabled. closes influxdata/influxdb#23147 feat: log the log level regardless of log level (#23425) feat: add version number to debug/vars (#23795) closes influxdata/influxdb#23793 feat: add the ability to log queries killed by `query-timeout` (#23978) * feat: add the ability to log queries killed by `query-timeout` * chore: update example config * chore: improve logging details chore: update changelog for 1.8 chore: add checkfmt.sh in workflows chore: fix indent and typo in config.sample.toml and meta.config.sample.toml fix race condition in meta client rename ShowShards to ListShards use rand.New(rand.NewSource) instead of deprecated rand.Seed gossip /announce will try to request other nodes when network failed remove any expired shards and give empty expiry time for inf duration shards from the /show-shards output fix read tcp i/o timeout in copy-shard chengshiwen#39 optimize connection pool and test * fixes data-race between Get() and Close() of channelPool * using sync.RWMutex instead of sync.Mutex in boundedPool * refactor idle timeout in pool * add connection pool test * replace fatih/pool.v2 with custom pool optimize meta executor and shard writer fix influxd close and reset issue fix(coordinator): fix closing opened twice in points writer optimize logger output under write timeout failed fix: fix typo 'exceeed' with 'exceeded' optimize total channel to avoid blocking in pool optimize build and test fix(tsm1): Fix data race of seriesKeys in deleteSeriesRange fix(config): fix max-concurrent-deletes typo in config.sample.toml release v1.8.11-c1.2.0
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When one partition in a TSI fails to open, all previously opened partitions should be cleaned up, and remaining partitions should not be opened (or should be opened and then cleaned up)
The text was updated successfully, but these errors were encountered: