-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to ignore errors in CSV parsing + improve error messages #4067
Conversation
src/processor/operator/persistent/reader/csv/base_csv_reader.cpp
Outdated
Show resolved
Hide resolved
src/processor/operator/persistent/reader/csv/base_csv_reader.cpp
Outdated
Show resolved
Hide resolved
Benchmark ResultMaster commit hash:
|
3f3eadc
to
ac7eeaf
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4067 +/- ##
==========================================
+ Coverage 83.99% 84.12% +0.13%
==========================================
Files 1307 1312 +5
Lines 51431 51703 +272
Branches 7143 7173 +30
==========================================
+ Hits 43198 43496 +298
+ Misses 8084 8058 -26
Partials 149 149 ☔ View full report in Codecov by Sentry. |
Benchmark ResultMaster commit hash:
|
@@ -205,8 +207,7 @@ bool BaseCSVReader::readBuffer(uint64_t* start) { | |||
auto readCount = fileInfo->readFile(buffer.get() + remaining, bufferReadSize); | |||
if (readCount == -1) { | |||
// LCOV_EXCL_START | |||
throw CopyException( | |||
stringFormat("Could not read from file {}: {}", fileInfo->path, posixErrMessage())); | |||
handleCopyException(stringFormat("Could not read from file: {}", posixErrMessage()), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming for these filesystem-related errors we should throw regardless of whether have have IGNORE_ERRORS
set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. @acquamarin and @mxwli ?
src/include/processor/operator/persistent/reader/csv/parallel_csv_reader.h
Outdated
Show resolved
Hide resolved
throw CopyException(stringFormat("Error in file {} on line {}: {}", reader->fileInfo->path, | ||
reader->getLineNumber(), e.what())); | ||
reader->handleCopyException(e.what()); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling conversion exceptions using try/catch should have a pretty high performance penalty, keeping it this way in this PR for simplicity though (and it is still manageable in release mode). Just keep in mind that if there are a large number of conversion exceptions and we have backtraces enabled (e.g. in the tests) this will be unbearably slow
082f89a
to
d09d4b6
Compare
Benchmark ResultMaster commit hash:
|
Benchmark ResultMaster commit hash:
|
src/include/processor/operator/persistent/reader/csv/base_csv_reader.h
Outdated
Show resolved
Hide resolved
ef3e3ac
to
34de000
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to see some benchmark numbers on COPY (and LOAD FROM, perhaps LOAD FROM makes more since you're only making changes to csv reader for now):
- to double check that we don't have performance regression when there are no errors.
- when error limit is small (default value), there isn't much performance degradation for LOAD FROM.
Let's see some numbers for sanity checks, and then it should be good to go to me.
src/include/processor/operator/persistent/reader/csv/csv_error.h
Outdated
Show resolved
Hide resolved
src/include/processor/operator/persistent/reader/csv/csv_error.h
Outdated
Show resolved
Hide resolved
src/include/processor/operator/persistent/reader/csv/parallel_csv_reader.h
Outdated
Show resolved
Hide resolved
src/include/processor/operator/persistent/reader/csv/csv_error.h
Outdated
Show resolved
Hide resolved
src/include/processor/operator/persistent/reader/csv/csv_error.h
Outdated
Show resolved
Hide resolved
Some quick sanity benchmarks. All were run with the parallel CSV reader using Load Timemaster: 807.13ms |
Benchmark ResultMaster commit hash:
|
…handler + move file path from error to error handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
src/include/processor/operator/persistent/reader/csv/csv_error_handler.h
Outdated
Show resolved
Hide resolved
if (hasBeenFinalized) { | ||
return; | ||
} | ||
hasBeenFinalized = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of the recursive calls to finalize can cause finalizeInternal() can currently be called multiple times (across pipelines). This hack prevents any issues that this can cause, but we should properly handle it by ensuring finalize() is called at most once per operator
Benchmark ResultMaster commit hash:
|
Benchmark ResultMaster commit hash:
|
Benchmark ResultMaster commit hash:
|
…uzudb#4067) * Copy exception handling concept Change handling of copy exceptions to go through CSVErrorHandler Add ignore errors csv reader option Update error messages + account for header in line number calculation Update expected error messages in tests Print ... if line was not completely parsed when error occurred Restart state machine if continuing after a parsing error Set end of line for EOF + more test updates Add more tests Report warning in query result Fix errors after rebase * Populate warning messages for rel batch insert * Avoid finalizing in aggregate scan operator * Add expected warning messages * Move warnings to factorized table + print in shell * Limit total number of cached warnings * Update tests with warnings * Deal with case where warning table is null * Avoid stack overflow caused by recursively calling parseCSV + add tests * Fix failing tests caused by trimming whitespace at end of test checking' * Add warnings to C API * Add tests for query result warnings * Code cleanup * Update shell printings * Improve test coverage * Add shell tests to codecov * Fix test failures * Fix windows test failures * Address review comments * Run clang-format * Avoid finalizing children for result collector * Revert changes to factorized table util as it is no longer needed * Revert shell test changes * Remove warnings from query result (to add to CALL() function) * Add function call show_warnings() to show warnings * Populate query id in warning table * Address review comments * Address review comments 2 * Move warning limit to DB config * Bump extension version * Address review comments again * Fix thread safety issue from progress func getting file offset * More review comment addressing * Address review comments * Fix more error messages * ResultCollector finalizes its children * More review comment addressing * Revert change to test_group.h * Optimizations - cache errors locally before flushing to shared error handler + move file path from error to error handler * Fix msvc compile error * Move some shared fields into common csv error handler * Hack around double-finalize issue * Bump extension version --------- Co-authored-by: CI Bot <royi-luo@users.noreply.github.com> (cherry picked from commit b841014)
Description
Partially addresses #1872, adding the following features:
IGNORE_ERRORS
option to CSV parsing that skips any invalid rows during parsing and later reports all invalid rows (like warnings)WARNING_LIMIT
added to limit the number of global warnings stored (for now in each operator, shared between threads that contain the same operator)Additional changes added to accommodate the feature:
ExecutionContext
to store warnings ifIGNORE_ERRORS
is enabledCALL show_warnings()
that displays all warnings encountered in the current connection (up to the configured limit). Output looks something like:Additionally, in the query result for copy an additional row is added if there are warnings that states the number of warnings encountered during that query. If warnings exceeded the limit, a lower bound on the number of warnings is given.
Contributor agreement