Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to ignore errors in CSV parsing + improve error messages #4067

Merged
merged 47 commits into from
Aug 22, 2024

Conversation

royi-luo
Copy link
Collaborator

@royi-luo royi-luo commented Aug 13, 2024

Description

Partially addresses #1872, adding the following features:

  • Add IGNORE_ERRORS option to CSV parsing that skips any invalid rows during parsing and later reports all invalid rows (like warnings)
    • Additional option WARNING_LIMIT added to limit the number of global warnings stored (for now in each operator, shared between threads that contain the same operator)
  • Improve error messages in the CSV reader, adding reconstructed lines + rework the method to calculate erroneous line numbers

Additional changes added to accommodate the feature:

  • Add warnings table to ExecutionContext to store warnings if IGNORE_ERRORS is enabled
  • Add statement CALL show_warnings() that displays all warnings encountered in the current connection (up to the configured limit). Output looks something like:
> CALL show_warnings() RETURN *;
┌──────────┬──────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────┬─────────────┬─────────────────────────┐
│ query_id │ message                                                                                  │ file_path                          │ line_number │ reconstructed_line      │
│ UINT64   │ STRING                                                                                   │ STRING                             │ UINT64      │ STRING                  │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────┼─────────────┼─────────────────────────┤
│ 1        │ Conversion exception: Cast failed. Could not convert "039472389abc23784928347" to INT32. │ /home/r34luo/test/csv/dataset.csv  │ 1717633     │ 039472389abc23784928347 │
│ 1        │ Conversion exception: Cast failed. Could not convert "abc" to INT32.                     │ /home/r34luo/test/csv/dataset1.csv │ 329248      │ abc                     │
└──────────┴──────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────┴─────────────┴─────────────────────────┘
(2 tuples)
(5 columns)
Time: 0.84ms (compiling), 1.11ms (executing)

Additionally, in the query result for copy an additional row is added if there are warnings that states the number of warnings encountered during that query. If warnings exceeded the limit, a lower bound on the number of warnings is given.

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ result                                                                                                   │
│ STRING                                                                                                   │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 75 tuples have been copied to the someTable table.                                                       │
│ 8192+ warnings encountered during copy. Use 'CALL show_warnings() RETURN *' to view the actual warnings. │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Contributor agreement

@royi-luo royi-luo requested a review from mxwli August 13, 2024 15:46
@royi-luo royi-luo self-assigned this Aug 13, 2024
Copy link

Benchmark Result

Master commit hash: 95d338e22f58405ebc255286f8f6594c01e9f0b0
Branch commit hash: cb62aae961ba59347046586100d94a2461b15895

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 685.36 675.04 10.32 (1.53%)
aggregation q28 11682.77 11363.26 319.51 (2.81%)
filter q14 158.97 150.84 8.13 (5.39%)
filter q15 162.03 152.91 9.12 (5.97%)
filter q16 338.60 328.89 9.71 (2.95%)
filter q17 479.38 471.30 8.08 (1.71%)
filter q18 1953.40 1941.45 11.95 (0.62%)
fixed_size_expr_evaluator q07 577.33 577.38 -0.05 (-0.01%)
fixed_size_expr_evaluator q08 788.86 779.19 9.67 (1.24%)
fixed_size_expr_evaluator q09 789.09 782.45 6.63 (0.85%)
fixed_size_expr_evaluator q10 276.75 267.65 9.10 (3.40%)
fixed_size_expr_evaluator q11 270.56 262.35 8.22 (3.13%)
fixed_size_expr_evaluator q12 270.40 261.64 8.76 (3.35%)
fixed_size_expr_evaluator q13 1511.40 1497.71 13.69 (0.91%)
fixed_size_seq_scan q23 154.88 146.35 8.53 (5.83%)
join q31 12.44 12.28 0.15 (1.25%)
ldbc_snb_ic q35 774.75 1087.33 -312.58 (-28.75%)
ldbc_snb_ic q36 50.40 41.66 8.74 (20.98%)
ldbc_snb_is q32 9.10 8.51 0.59 (6.91%)
ldbc_snb_is q33 19.35 18.52 0.84 (4.52%)
ldbc_snb_is q34 7.88 8.90 -1.02 (-11.49%)
multi-rel multi-rel-large-scan 3140.82 2849.47 291.35 (10.22%)
multi-rel multi-rel-lookup 57.50 60.15 -2.65 (-4.41%)
multi-rel multi-rel-small-scan 52.23 63.32 -11.08 (-17.50%)
order_by q25 166.45 153.82 12.63 (8.21%)
order_by q26 480.50 475.39 5.11 (1.08%)
order_by q27 1431.09 1434.29 -3.20 (-0.22%)
scan_after_filter q01 207.04 198.20 8.84 (4.46%)
scan_after_filter q02 193.20 186.16 7.04 (3.78%)
shortest_path_ldbc100 q39 152.96 155.86 -2.91 (-1.87%)
var_size_expr_evaluator q03 2082.66 2086.81 -4.15 (-0.20%)
var_size_expr_evaluator q04 2275.82 2263.25 12.57 (0.56%)
var_size_expr_evaluator q05 2628.27 2640.74 -12.47 (-0.47%)
var_size_expr_evaluator q06 1348.85 1346.79 2.06 (0.15%)
var_size_seq_scan q19 1491.96 1490.11 1.85 (0.12%)
var_size_seq_scan q20 3230.58 3153.42 77.17 (2.45%)
var_size_seq_scan q21 2468.93 2455.25 13.68 (0.56%)
var_size_seq_scan q22 134.94 134.86 0.08 (0.06%)

@royi-luo royi-luo force-pushed the royi/copy-exception-handling branch 5 times, most recently from 3f3eadc to ac7eeaf Compare August 15, 2024 22:33
Copy link

codecov bot commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 94.24307% with 27 lines in your changes missing coverage. Please review.

Project coverage is 84.12%. Comparing base (ab89ebd) to head (7ad600c).
Report is 4 commits behind head on master.

Files Patch % Lines
...erator/persistent/reader/csv/csv_error_handler.cpp 84.03% 19 Missing ⚠️
...rocessor/operator/persistent/reader/csv/driver.cpp 63.63% 4 Missing ⚠️
src/include/main/settings.h 60.00% 2 Missing ⚠️
...operator/persistent/reader/csv/base_csv_reader.cpp 98.62% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4067      +/-   ##
==========================================
+ Coverage   83.99%   84.12%   +0.13%     
==========================================
  Files        1307     1312       +5     
  Lines       51431    51703     +272     
  Branches     7143     7173      +30     
==========================================
+ Hits        43198    43496     +298     
+ Misses       8084     8058      -26     
  Partials      149      149              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Benchmark Result

Master commit hash: 9a97f751f564af17cf0a5d53b9155364f8efc587
Branch commit hash: 6bc5c9bc9e4cce09ecc91bd8b7c7a06f34b514e2

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 673.94 676.76 -2.82 (-0.42%)
aggregation q28 11355.40 11527.85 -172.46 (-1.50%)
filter q14 142.28 150.65 -8.37 (-5.56%)
filter q15 146.57 152.08 -5.51 (-3.62%)
filter q16 324.45 324.59 -0.14 (-0.04%)
filter q17 467.71 471.50 -3.78 (-0.80%)
filter q18 1936.99 1969.06 -32.07 (-1.63%)
fixed_size_expr_evaluator q07 559.52 565.76 -6.23 (-1.10%)
fixed_size_expr_evaluator q08 770.58 775.72 -5.14 (-0.66%)
fixed_size_expr_evaluator q09 772.84 772.79 0.05 (0.01%)
fixed_size_expr_evaluator q10 257.46 264.27 -6.81 (-2.58%)
fixed_size_expr_evaluator q11 254.53 262.37 -7.84 (-2.99%)
fixed_size_expr_evaluator q12 252.75 258.56 -5.81 (-2.25%)
fixed_size_expr_evaluator q13 1489.88 1488.27 1.61 (0.11%)
fixed_size_seq_scan q23 135.61 148.50 -12.88 (-8.67%)
join q31 12.16 11.91 0.25 (2.11%)
ldbc_snb_ic q35 771.74 837.67 -65.93 (-7.87%)
ldbc_snb_ic q36 50.54 30.43 20.11 (66.07%)
ldbc_snb_is q32 8.72 10.30 -1.58 (-15.37%)
ldbc_snb_is q33 16.80 16.19 0.62 (3.80%)
ldbc_snb_is q34 8.24 8.50 -0.25 (-2.98%)
multi-rel multi-rel-large-scan 2846.36 2754.35 92.01 (3.34%)
multi-rel multi-rel-lookup 65.36 53.73 11.64 (21.66%)
multi-rel multi-rel-small-scan 58.09 48.56 9.53 (19.63%)
order_by q25 151.36 156.59 -5.23 (-3.34%)
order_by q26 473.09 478.09 -5.00 (-1.05%)
order_by q27 1399.86 1435.62 -35.76 (-2.49%)
scan_after_filter q01 192.14 197.75 -5.62 (-2.84%)
scan_after_filter q02 178.25 187.73 -9.48 (-5.05%)
shortest_path_ldbc100 q39 52.62 52.93 -0.31 (-0.59%)
var_size_expr_evaluator q03 2064.06 2091.88 -27.82 (-1.33%)
var_size_expr_evaluator q04 2248.57 2297.03 -48.46 (-2.11%)
var_size_expr_evaluator q05 2549.78 2702.27 -152.48 (-5.64%)
var_size_expr_evaluator q06 1334.54 1360.33 -25.79 (-1.90%)
var_size_seq_scan q19 1477.72 1484.50 -6.78 (-0.46%)
var_size_seq_scan q20 3165.08 3158.79 6.30 (0.20%)
var_size_seq_scan q21 2393.72 2405.72 -12.00 (-0.50%)
var_size_seq_scan q22 128.65 133.80 -5.15 (-3.85%)

@@ -205,8 +207,7 @@ bool BaseCSVReader::readBuffer(uint64_t* start) {
auto readCount = fileInfo->readFile(buffer.get() + remaining, bufferReadSize);
if (readCount == -1) {
// LCOV_EXCL_START
throw CopyException(
stringFormat("Could not read from file {}: {}", fileInfo->path, posixErrMessage()));
handleCopyException(stringFormat("Could not read from file: {}", posixErrMessage()), true);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming for these filesystem-related errors we should throw regardless of whether have have IGNORE_ERRORS set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. @acquamarin and @mxwli ?

Comment on lines -43 to +42
throw CopyException(stringFormat("Error in file {} on line {}: {}", reader->fileInfo->path,
reader->getLineNumber(), e.what()));
reader->handleCopyException(e.what());
return false;
Copy link
Collaborator Author

@royi-luo royi-luo Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling conversion exceptions using try/catch should have a pretty high performance penalty, keeping it this way in this PR for simplicity though (and it is still manageable in release mode). Just keep in mind that if there are a large number of conversion exceptions and we have backtraces enabled (e.g. in the tests) this will be unbearably slow

@royi-luo royi-luo force-pushed the royi/copy-exception-handling branch 2 times, most recently from 082f89a to d09d4b6 Compare August 16, 2024 18:41
Makefile Outdated Show resolved Hide resolved
Copy link

Benchmark Result

Master commit hash: 700b10f71c4e9ea19b4cd7e57ec864a38033225e
Branch commit hash: 2b8f3ae3924035646f838cd2f660513672151afe

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 676.98 668.53 8.44 (1.26%)
aggregation q28 12347.79 11953.42 394.36 (3.30%)
copy node-Comment 68836.47 68567.14 269.33 (0.39%)
copy node-Forum 5448.92 5609.97 -161.05 (-2.87%)
copy node-Organisation 1536.13 1542.22 -6.09 (-0.39%)
copy node-Person 2946.21 2926.52 19.69 (0.67%)
copy node-Place 1468.69 1462.74 5.95 (0.41%)
copy node-Post 27544.90 27296.52 248.38 (0.91%)
copy node-Tag 1490.63 1626.23 -135.60 (-8.34%)
copy node-Tagclass 1438.37 1400.46 37.91 (2.71%)
copy rel-comment-hasCreator 57838.88 55925.00 1913.88 (3.42%)
copy rel-comment-hasTag 78336.54 77494.42 842.12 (1.09%)
copy rel-comment-isLocatedIn 63643.18 64822.00 -1178.82 (-1.82%)
copy rel-containerOf 18688.84 16836.69 1852.15 (11.00%)
copy rel-forum-hasTag 4161.51 4038.90 122.61 (3.04%)
copy rel-hasInterest 2776.18 2867.67 -91.49 (-3.19%)
copy rel-hasMember 53201.70 53480.62 -278.92 (-0.52%)
copy rel-hasModerator 2092.69 1873.10 219.59 (11.72%)
copy rel-hasType 629.89 719.59 -89.70 (-12.47%)
copy rel-isPartOf 524.56 575.30 -50.74 (-8.82%)
copy rel-isSubclassOf 591.77 387.40 204.37 (52.75%)
copy rel-knows 5902.13 6089.40 -187.27 (-3.08%)
copy rel-likes-comment 85275.16 86746.41 -1471.25 (-1.70%)
copy rel-likes-post 32910.34 34424.23 -1513.89 (-4.40%)
copy rel-organisation-isLocatedIn 409.94 466.05 -56.11 (-12.04%)
copy rel-person-isLocatedIn 831.53 918.55 -87.02 (-9.47%)
copy rel-post-hasCreator 15918.25 18421.75 -2503.50 (-13.59%)
copy rel-post-hasTag 22857.01 22652.77 204.24 (0.90%)
copy rel-post-isLocatedIn 18956.12 18384.76 571.36 (3.11%)
copy rel-replyOf-comment 64974.00 64846.55 127.45 (0.20%)
copy rel-replyOf-post 47606.74 47921.58 -314.84 (-0.66%)
copy rel-studyAt 1103.36 975.53 127.83 (13.10%)
copy rel-workAt 1192.68 1164.92 27.76 (2.38%)
filter q14 155.02 144.67 10.36 (7.16%)
filter q15 163.21 147.65 15.56 (10.54%)
filter q16 331.89 321.50 10.39 (3.23%)
filter q17 473.18 465.22 7.97 (1.71%)
filter q18 1984.27 1993.28 -9.01 (-0.45%)
fixed_size_expr_evaluator q07 571.45 555.62 15.84 (2.85%)
fixed_size_expr_evaluator q08 786.65 777.35 9.30 (1.20%)
fixed_size_expr_evaluator q09 781.98 778.60 3.38 (0.43%)
fixed_size_expr_evaluator q10 265.86 257.71 8.15 (3.16%)
fixed_size_expr_evaluator q11 260.51 251.64 8.87 (3.53%)
fixed_size_expr_evaluator q12 259.91 250.60 9.30 (3.71%)
fixed_size_expr_evaluator q13 1496.75 1484.57 12.18 (0.82%)
fixed_size_seq_scan q23 146.22 133.27 12.95 (9.72%)
join q31 12.51 11.39 1.12 (9.82%)
ldbc_snb_ic q35 748.60 771.29 -22.69 (-2.94%)
ldbc_snb_ic q36 50.37 48.19 2.18 (4.52%)
ldbc_snb_is q32 9.37 8.69 0.69 (7.91%)
ldbc_snb_is q33 18.47 15.30 3.18 (20.75%)
ldbc_snb_is q34 8.19 7.73 0.46 (5.99%)
multi-rel multi-rel-large-scan 3495.36 2881.53 613.83 (21.30%)
multi-rel multi-rel-lookup 51.17 66.20 -15.03 (-22.70%)
multi-rel multi-rel-small-scan 64.00 53.93 10.07 (18.67%)
order_by q25 157.20 148.91 8.29 (5.57%)
order_by q26 497.65 464.11 33.54 (7.23%)
order_by q27 1444.67 1422.50 22.17 (1.56%)
scan_after_filter q01 197.41 192.10 5.31 (2.77%)
scan_after_filter q02 187.37 182.94 4.42 (2.42%)
shortest_path_ldbc100 q39 79.78 97.14 -17.36 (-17.87%)
var_size_expr_evaluator q03 2102.17 2071.60 30.57 (1.48%)
var_size_expr_evaluator q04 2278.20 2263.46 14.74 (0.65%)
var_size_expr_evaluator q05 2648.70 2624.07 24.63 (0.94%)
var_size_expr_evaluator q06 1355.23 1337.83 17.41 (1.30%)
var_size_seq_scan q19 1511.13 1484.97 26.16 (1.76%)
var_size_seq_scan q20 3190.27 3194.70 -4.43 (-0.14%)
var_size_seq_scan q21 2467.78 2430.11 37.67 (1.55%)
var_size_seq_scan q22 135.49 133.10 2.40 (1.80%)

Copy link

Benchmark Result

Master commit hash: 700b10f71c4e9ea19b4cd7e57ec864a38033225e
Branch commit hash: f47ca47c4bcbc00cf69974c623cad874b760eba3

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 671.35 668.53 2.82 (0.42%)
aggregation q28 11750.74 11953.42 -202.69 (-1.70%)
copy node-Comment 72959.93 68567.14 4392.79 (6.41%)
copy node-Forum 5304.09 5609.97 -305.88 (-5.45%)
copy node-Organisation 1575.01 1542.22 32.79 (2.13%)
copy node-Person 2942.77 2926.52 16.25 (0.56%)
copy node-Place 1572.44 1462.74 109.70 (7.50%)
copy node-Post 25490.84 27296.52 -1805.68 (-6.62%)
copy node-Tag 1635.97 1626.23 9.74 (0.60%)
copy node-Tagclass 1468.61 1400.46 68.15 (4.87%)
copy rel-comment-hasCreator 60757.29 55925.00 4832.29 (8.64%)
copy rel-comment-hasTag 77995.23 77494.42 500.81 (0.65%)
copy rel-comment-isLocatedIn 63735.69 64822.00 -1086.31 (-1.68%)
copy rel-containerOf 17618.62 16836.69 781.93 (4.64%)
copy rel-forum-hasTag 3932.63 4038.90 -106.27 (-2.63%)
copy rel-hasInterest 2838.64 2867.67 -29.03 (-1.01%)
copy rel-hasMember 51931.92 53480.62 -1548.70 (-2.90%)
copy rel-hasModerator 2022.87 1873.10 149.77 (8.00%)
copy rel-hasType 767.30 719.59 47.71 (6.63%)
copy rel-isPartOf 330.15 575.30 -245.15 (-42.61%)
copy rel-isSubclassOf 389.49 387.40 2.09 (0.54%)
copy rel-knows 6372.24 6089.40 282.84 (4.64%)
copy rel-likes-comment 87875.65 86746.41 1129.24 (1.30%)
copy rel-likes-post 34652.81 34424.23 228.58 (0.66%)
copy rel-organisation-isLocatedIn 690.90 466.05 224.85 (48.25%)
copy rel-person-isLocatedIn 890.76 918.55 -27.79 (-3.03%)
copy rel-post-hasCreator 18172.13 18421.75 -249.62 (-1.36%)
copy rel-post-hasTag 22304.46 22652.77 -348.31 (-1.54%)
copy rel-post-isLocatedIn 18551.51 18384.76 166.75 (0.91%)
copy rel-replyOf-comment 66183.61 64846.55 1337.06 (2.06%)
copy rel-replyOf-post 49994.66 47921.58 2073.08 (4.33%)
copy rel-studyAt 1041.40 975.53 65.87 (6.75%)
copy rel-workAt 1165.41 1164.92 0.49 (0.04%)
filter q14 142.29 144.67 -2.38 (-1.64%)
filter q15 143.22 147.65 -4.43 (-3.00%)
filter q16 323.68 321.50 2.17 (0.68%)
filter q17 465.95 465.22 0.73 (0.16%)
filter q18 1929.00 1993.28 -64.28 (-3.22%)
fixed_size_expr_evaluator q07 555.19 555.62 -0.43 (-0.08%)
fixed_size_expr_evaluator q08 762.83 777.35 -14.52 (-1.87%)
fixed_size_expr_evaluator q09 774.85 778.60 -3.75 (-0.48%)
fixed_size_expr_evaluator q10 256.78 257.71 -0.93 (-0.36%)
fixed_size_expr_evaluator q11 251.44 251.64 -0.20 (-0.08%)
fixed_size_expr_evaluator q12 250.85 250.60 0.25 (0.10%)
fixed_size_expr_evaluator q13 1485.52 1484.57 0.94 (0.06%)
fixed_size_seq_scan q23 133.02 133.27 -0.25 (-0.19%)
join q31 11.95 11.39 0.56 (4.90%)
ldbc_snb_ic q35 794.57 771.29 23.29 (3.02%)
ldbc_snb_ic q36 47.14 48.19 -1.05 (-2.18%)
ldbc_snb_is q32 9.77 8.69 1.09 (12.49%)
ldbc_snb_is q33 16.79 15.30 1.49 (9.75%)
ldbc_snb_is q34 8.63 7.73 0.90 (11.58%)
multi-rel multi-rel-large-scan 2820.19 2881.53 -61.33 (-2.13%)
multi-rel multi-rel-lookup 67.20 66.20 1.00 (1.50%)
multi-rel multi-rel-small-scan 52.47 53.93 -1.46 (-2.71%)
order_by q25 149.22 148.91 0.31 (0.21%)
order_by q26 467.49 464.11 3.38 (0.73%)
order_by q27 1427.42 1422.50 4.92 (0.35%)
scan_after_filter q01 189.95 192.10 -2.15 (-1.12%)
scan_after_filter q02 180.51 182.94 -2.44 (-1.33%)
shortest_path_ldbc100 q39 91.63 97.14 -5.51 (-5.67%)
var_size_expr_evaluator q03 2057.78 2071.60 -13.83 (-0.67%)
var_size_expr_evaluator q04 2244.59 2263.46 -18.87 (-0.83%)
var_size_expr_evaluator q05 2562.61 2624.07 -61.46 (-2.34%)
var_size_expr_evaluator q06 1353.40 1337.83 15.57 (1.16%)
var_size_seq_scan q19 1483.08 1484.97 -1.89 (-0.13%)
var_size_seq_scan q20 3108.06 3194.70 -86.64 (-2.71%)
var_size_seq_scan q21 2423.00 2430.11 -7.10 (-0.29%)
var_size_seq_scan q22 132.91 133.10 -0.19 (-0.14%)

@royi-luo royi-luo changed the title [DRAFT] Add option to ignore errors in CSV parsing + improve error messages Add option to ignore errors in CSV parsing + improve error messages Aug 17, 2024
@royi-luo royi-luo requested review from ray6080 and acquamarin August 17, 2024 01:06
src/include/processor/execution_context.h Outdated Show resolved Hide resolved
src/processor/execution_context.cpp Outdated Show resolved Hide resolved
src/include/processor/execution_context.h Outdated Show resolved Hide resolved
src/processor/execution_context.cpp Outdated Show resolved Hide resolved
src/processor/operator/physical_operator.cpp Outdated Show resolved Hide resolved
src/processor/operator/aggregate/base_aggregate_scan.cpp Outdated Show resolved Hide resolved
@royi-luo royi-luo force-pushed the royi/copy-exception-handling branch 2 times, most recently from ef3e3ac to 34de000 Compare August 19, 2024 14:59
CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to see some benchmark numbers on COPY (and LOAD FROM, perhaps LOAD FROM makes more since you're only making changes to csv reader for now):

  • to double check that we don't have performance regression when there are no errors.
  • when error limit is small (default value), there isn't much performance degradation for LOAD FROM.

Let's see some numbers for sanity checks, and then it should be good to go to me.

src/function/table/call/show_tables.cpp Outdated Show resolved Hide resolved
src/include/processor/operator/physical_operator.h Outdated Show resolved Hide resolved
test/test_files/exceptions/copy/invalid_row.test Outdated Show resolved Hide resolved
test/test_files/load_from/load_from.test Outdated Show resolved Hide resolved
@royi-luo
Copy link
Collaborator Author

royi-luo commented Aug 20, 2024

Some quick sanity benchmarks. All were run with the parallel CSV reader using LOAD FROM on a dataset with ~100M nodes and two columns (one INT and STRING):

Load Time

master: 807.13ms
Copy exception branch (no errors): 819.77ms
Copy exception branch (~19M errors, warning limit 8K) 959.60ms
Copy exception branch (~19M errors, warning limit 1M) 5334.42ms

Copy link

Benchmark Result

Master commit hash: 4020110b4c018ac6789bdb31f5af64ee8dc0469f
Branch commit hash: 0697a97a0fe1f36dde6196db37d5e75c67f8b4dc

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 669.97 671.95 -1.97 (-0.29%)
aggregation q28 12152.42 11910.39 242.03 (2.03%)
filter q14 150.63 151.38 -0.74 (-0.49%)
filter q15 151.22 154.79 -3.56 (-2.30%)
filter q16 328.60 328.20 0.39 (0.12%)
filter q17 470.77 476.84 -6.07 (-1.27%)
filter q18 1893.00 1900.48 -7.48 (-0.39%)
fixed_size_expr_evaluator q07 562.60 563.22 -0.62 (-0.11%)
fixed_size_expr_evaluator q08 776.95 775.78 1.17 (0.15%)
fixed_size_expr_evaluator q09 782.15 774.32 7.82 (1.01%)
fixed_size_expr_evaluator q10 265.55 265.41 0.14 (0.05%)
fixed_size_expr_evaluator q11 260.64 259.34 1.30 (0.50%)
fixed_size_expr_evaluator q12 258.67 258.57 0.10 (0.04%)
fixed_size_expr_evaluator q13 1483.56 1484.77 -1.20 (-0.08%)
fixed_size_seq_scan q23 143.86 145.20 -1.34 (-0.92%)
join q31 12.43 12.12 0.31 (2.59%)
ldbc_snb_ic q35 755.16 1037.84 -282.69 (-27.24%)
ldbc_snb_ic q36 46.56 52.55 -5.99 (-11.40%)
ldbc_snb_is q32 10.54 10.35 0.19 (1.88%)
ldbc_snb_is q33 18.54 17.76 0.78 (4.41%)
ldbc_snb_is q34 8.66 7.58 1.08 (14.21%)
multi-rel multi-rel-large-scan 2805.75 2861.49 -55.74 (-1.95%)
multi-rel multi-rel-lookup 40.17 50.94 -10.77 (-21.14%)
multi-rel multi-rel-small-scan 74.40 49.08 25.32 (51.59%)
order_by q25 156.92 157.19 -0.27 (-0.17%)
order_by q26 471.95 478.13 -6.18 (-1.29%)
order_by q27 1425.44 1414.46 10.99 (0.78%)
scan_after_filter q01 197.08 199.24 -2.16 (-1.08%)
scan_after_filter q02 186.04 187.30 -1.26 (-0.67%)
shortest_path_ldbc100 q39 114.30 114.31 -0.01 (-0.01%)
var_size_expr_evaluator q03 2099.88 2086.16 13.71 (0.66%)
var_size_expr_evaluator q04 2245.48 2216.28 29.21 (1.32%)
var_size_expr_evaluator q05 2715.80 2699.20 16.60 (0.61%)
var_size_expr_evaluator q06 1382.82 1375.88 6.94 (0.50%)
var_size_seq_scan q19 1522.42 1474.04 48.39 (3.28%)
var_size_seq_scan q20 3134.10 3167.00 -32.90 (-1.04%)
var_size_seq_scan q21 2430.31 2405.87 24.44 (1.02%)
var_size_seq_scan q22 134.86 130.96 3.90 (2.98%)

src/include/common/string_utils.h Outdated Show resolved Hide resolved
src/include/common/string_utils.h Outdated Show resolved Hide resolved
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +189 to +192
if (hasBeenFinalized) {
return;
}
hasBeenFinalized = true;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of the recursive calls to finalize can cause finalizeInternal() can currently be called multiple times (across pipelines). This hack prevents any issues that this can cause, but we should properly handle it by ensuring finalize() is called at most once per operator

Copy link

Benchmark Result

Master commit hash: 700b10f71c4e9ea19b4cd7e57ec864a38033225e
Branch commit hash: d9f250841110ba569e36a42d1d8937081736742e

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 670.76 668.53 2.23 (0.33%)
aggregation q28 11327.36 11953.42 -626.07 (-5.24%)
filter q14 151.36 144.67 6.69 (4.63%)
filter q15 154.63 147.65 6.98 (4.73%)
filter q16 334.31 321.50 12.80 (3.98%)
filter q17 475.30 465.22 10.09 (2.17%)
filter q18 1967.28 1993.28 -26.00 (-1.30%)
fixed_size_expr_evaluator q07 565.63 555.62 10.01 (1.80%)
fixed_size_expr_evaluator q08 775.05 777.35 -2.30 (-0.30%)
fixed_size_expr_evaluator q09 777.28 778.60 -1.31 (-0.17%)
fixed_size_expr_evaluator q10 264.85 257.71 7.14 (2.77%)
fixed_size_expr_evaluator q11 261.32 251.64 9.68 (3.85%)
fixed_size_expr_evaluator q12 260.90 250.60 10.29 (4.11%)
fixed_size_expr_evaluator q13 1498.37 1484.57 13.80 (0.93%)
fixed_size_seq_scan q23 142.48 133.27 9.21 (6.91%)
join q31 13.49 11.39 2.11 (18.49%)
ldbc_snb_ic q35 784.95 771.29 13.66 (1.77%)
ldbc_snb_ic q36 48.13 48.19 -0.05 (-0.11%)
ldbc_snb_is q32 8.63 8.69 -0.06 (-0.64%)
ldbc_snb_is q33 15.23 15.30 -0.07 (-0.47%)
ldbc_snb_is q34 7.26 7.73 -0.47 (-6.05%)
multi-rel multi-rel-large-scan 4380.25 2881.53 1498.73 (52.01%)
multi-rel multi-rel-lookup 72.20 66.20 5.99 (9.05%)
multi-rel multi-rel-small-scan 60.58 53.93 6.64 (12.32%)
order_by q25 159.01 148.91 10.10 (6.79%)
order_by q26 471.06 464.11 6.95 (1.50%)
order_by q27 1439.85 1422.50 17.35 (1.22%)
scan_after_filter q01 198.66 192.10 6.56 (3.41%)
scan_after_filter q02 185.53 182.94 2.59 (1.41%)
shortest_path_ldbc100 q39 163.12 97.14 65.97 (67.91%)
var_size_expr_evaluator q03 2079.73 2071.60 8.12 (0.39%)
var_size_expr_evaluator q04 2271.61 2263.46 8.15 (0.36%)
var_size_expr_evaluator q05 2648.34 2624.07 24.27 (0.92%)
var_size_expr_evaluator q06 1400.07 1337.83 62.24 (4.65%)
var_size_seq_scan q19 1491.24 1484.97 6.27 (0.42%)
var_size_seq_scan q20 3203.34 3194.70 8.64 (0.27%)
var_size_seq_scan q21 2474.78 2430.11 44.67 (1.84%)
var_size_seq_scan q22 134.74 133.10 1.65 (1.24%)

Copy link

Benchmark Result

Master commit hash: ab89ebdeb5a0c774eb479938f45db2e738c5b73a
Branch commit hash: a514060f7a583e0dbf6a0a522ce4504da4b75ec4

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 663.04 680.31 -17.27 (-2.54%)
aggregation q28 11420.20 11932.65 -512.46 (-4.29%)
filter q14 142.72 161.00 -18.28 (-11.35%)
filter q15 145.34 162.69 -17.35 (-10.66%)
filter q16 322.19 336.55 -14.36 (-4.27%)
filter q17 463.33 483.91 -20.58 (-4.25%)
filter q18 1929.15 2002.90 -73.75 (-3.68%)
fixed_size_expr_evaluator q07 555.87 575.97 -20.10 (-3.49%)
fixed_size_expr_evaluator q08 777.98 787.46 -9.48 (-1.20%)
fixed_size_expr_evaluator q09 767.23 791.09 -23.86 (-3.02%)
fixed_size_expr_evaluator q10 254.85 276.14 -21.30 (-7.71%)
fixed_size_expr_evaluator q11 250.63 269.40 -18.77 (-6.97%)
fixed_size_expr_evaluator q12 250.05 268.70 -18.65 (-6.94%)
fixed_size_expr_evaluator q13 1487.13 1513.98 -26.85 (-1.77%)
fixed_size_seq_scan q23 139.76 152.57 -12.81 (-8.40%)
join q31 13.57 13.21 0.36 (2.73%)
ldbc_snb_ic q35 794.40 789.59 4.82 (0.61%)
ldbc_snb_ic q36 54.39 55.81 -1.42 (-2.55%)
ldbc_snb_is q32 9.73 9.43 0.30 (3.15%)
ldbc_snb_is q33 17.69 17.17 0.53 (3.07%)
ldbc_snb_is q34 8.77 8.61 0.17 (1.92%)
multi-rel multi-rel-large-scan 3917.42 2812.31 1105.11 (39.30%)
multi-rel multi-rel-lookup 59.38 48.72 10.66 (21.89%)
multi-rel multi-rel-small-scan 51.92 49.44 2.48 (5.02%)
order_by q25 149.21 168.01 -18.80 (-11.19%)
order_by q26 468.81 506.19 -37.38 (-7.39%)
order_by q27 1412.06 1442.20 -30.14 (-2.09%)
scan_after_filter q01 195.21 208.66 -13.44 (-6.44%)
scan_after_filter q02 180.23 198.17 -17.94 (-9.05%)
shortest_path_ldbc100 q39 92.74 92.46 0.28 (0.31%)
var_size_expr_evaluator q03 2070.86 2094.34 -23.48 (-1.12%)
var_size_expr_evaluator q04 2271.38 2286.91 -15.54 (-0.68%)
var_size_expr_evaluator q05 2565.44 2603.76 -38.32 (-1.47%)
var_size_expr_evaluator q06 1384.87 1417.10 -32.22 (-2.27%)
var_size_seq_scan q19 1471.39 1504.91 -33.53 (-2.23%)
var_size_seq_scan q20 3125.75 3116.83 8.92 (0.29%)
var_size_seq_scan q21 2386.80 2435.63 -48.84 (-2.01%)
var_size_seq_scan q22 131.24 138.36 -7.12 (-5.15%)

Copy link

Benchmark Result

Master commit hash: ab89ebdeb5a0c774eb479938f45db2e738c5b73a
Branch commit hash: d2f0417305e1ad75d652f754ebb6324c358c31fd

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 664.15 680.31 -16.16 (-2.38%)
aggregation q28 11627.41 11932.65 -305.24 (-2.56%)
filter q14 142.32 161.00 -18.68 (-11.60%)
filter q15 144.41 162.69 -18.28 (-11.24%)
filter q16 319.84 336.55 -16.71 (-4.96%)
filter q17 467.44 483.91 -16.47 (-3.40%)
filter q18 1930.01 2002.90 -72.89 (-3.64%)
fixed_size_expr_evaluator q07 553.85 575.97 -22.12 (-3.84%)
fixed_size_expr_evaluator q08 763.78 787.46 -23.68 (-3.01%)
fixed_size_expr_evaluator q09 764.31 791.09 -26.78 (-3.39%)
fixed_size_expr_evaluator q10 256.92 276.14 -19.22 (-6.96%)
fixed_size_expr_evaluator q11 250.58 269.40 -18.82 (-6.99%)
fixed_size_expr_evaluator q12 251.00 268.70 -17.71 (-6.59%)
fixed_size_expr_evaluator q13 1487.53 1513.98 -26.45 (-1.75%)
fixed_size_seq_scan q23 131.93 152.57 -20.64 (-13.53%)
join q31 13.14 13.21 -0.07 (-0.51%)
ldbc_snb_ic q35 801.93 789.59 12.35 (1.56%)
ldbc_snb_ic q36 47.21 55.81 -8.60 (-15.40%)
ldbc_snb_is q32 9.31 9.43 -0.12 (-1.32%)
ldbc_snb_is q33 17.49 17.17 0.32 (1.87%)
ldbc_snb_is q34 8.14 8.61 -0.47 (-5.42%)
multi-rel multi-rel-large-scan 2936.51 2812.31 124.20 (4.42%)
multi-rel multi-rel-lookup 67.05 48.72 18.33 (37.62%)
multi-rel multi-rel-small-scan 71.83 49.44 22.39 (45.28%)
order_by q25 147.68 168.01 -20.33 (-12.10%)
order_by q26 465.42 506.19 -40.77 (-8.05%)
order_by q27 1410.97 1442.20 -31.23 (-2.17%)
scan_after_filter q01 192.53 208.66 -16.12 (-7.73%)
scan_after_filter q02 180.53 198.17 -17.65 (-8.90%)
shortest_path_ldbc100 q39 93.93 92.46 1.47 (1.59%)
var_size_expr_evaluator q03 2066.33 2094.34 -28.01 (-1.34%)
var_size_expr_evaluator q04 2260.97 2286.91 -25.94 (-1.13%)
var_size_expr_evaluator q05 2556.35 2603.76 -47.41 (-1.82%)
var_size_expr_evaluator q06 1387.30 1417.10 -29.80 (-2.10%)
var_size_seq_scan q19 1467.45 1504.91 -37.46 (-2.49%)
var_size_seq_scan q20 3100.91 3116.83 -15.91 (-0.51%)
var_size_seq_scan q21 2387.98 2435.63 -47.66 (-1.96%)
var_size_seq_scan q22 131.80 138.36 -6.56 (-4.74%)

@royi-luo royi-luo merged commit b841014 into master Aug 22, 2024
23 checks passed
@royi-luo royi-luo deleted the royi/copy-exception-handling branch August 22, 2024 19:18
ted-wq-x pushed a commit to ted-wq-x/kuzu that referenced this pull request Nov 14, 2024
…uzudb#4067)

* Copy exception handling concept

Change handling of copy exceptions to go through CSVErrorHandler

Add ignore errors csv reader option

Update error messages + account for header in line number calculation

Update expected error messages in tests

Print ... if line was not completely parsed when error occurred

Restart state machine if continuing after a parsing error

Set end of line for EOF + more test updates

Add more tests

Report warning in query result

Fix errors after rebase

* Populate warning messages for rel batch insert

* Avoid finalizing in aggregate scan operator

* Add expected warning messages

* Move warnings to factorized table + print in shell

* Limit total number of cached warnings

* Update tests with warnings

* Deal with case where warning table is null

* Avoid stack overflow caused by recursively calling parseCSV + add tests

* Fix failing tests caused by trimming whitespace at end of test checking'

* Add warnings to C API

* Add tests for query result warnings

* Code cleanup

* Update shell printings

* Improve test coverage

* Add shell tests to codecov

* Fix test failures

* Fix windows test failures

* Address review comments

* Run clang-format

* Avoid finalizing children for result collector

* Revert changes to factorized table util as it is no longer needed

* Revert shell test changes

* Remove warnings from query result (to add to CALL() function)

* Add function call show_warnings() to show warnings

* Populate query id in warning table

* Address review comments

* Address review comments 2

* Move warning limit to DB config

* Bump extension version

* Address review comments again

* Fix thread safety issue from progress func getting file offset

* More review comment addressing

* Address review comments

* Fix more error messages

* ResultCollector finalizes its children

* More review comment addressing

* Revert change to test_group.h

* Optimizations - cache errors locally before flushing to shared error handler + move file path from error to error handler

* Fix msvc compile error

* Move some shared fields into common csv error handler

* Hack around double-finalize issue

* Bump extension version

---------

Co-authored-by: CI Bot <royi-luo@users.noreply.github.com>

(cherry picked from commit b841014)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants