-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings #38351
Comments
/component lightning |
I've captured the profile of 6.2.0. The bottleneck lies in converting each string datum value from data source into a normalized string datum value. This logic was first introduced in 5.4.0. Here is the PR . Originally, the string value doesn't do any normalizing processing. After this PR, each string datum will check the charset and collation to normalize the value. By default, when creating a string column in a table without setting collations and charsets, the collation is empty, and the charset is 'binary'. In this case, when the normalizing logic tries to check the collation, it cannot find the collation. So the logic generates an error object with a stack trace and returned back. When the upper level caller catches the error, it logs a warning message, ignores the error, and falls back to return the raw string as the datum value. So compared to 5.3.3, each string value normalization will generate an extra error object with a stack trace, and record an extra warning log. Surprisingly, the cumulative time for generating errors with stack traces is huge. That causes the bottleneck. |
I've taken a deeper look why the collation settings are not correct. Looks like it is related to the parquet data source. When reading parquet data and set the string datum, the collation is passed as an empty string:
Then, when the logic comes here to normalize the string datum: Line 199 in accff68
the empty string cannot find an entry in the collation name map, thus returning an error with a stack trace. Currently, all other data format ( csv / sql ) will pass "utf8mb4_bin" as the collation value. So here the solution is to change all the |
After taking a further look at the code, to trigger this performance regression, the encoding of some target table's string columns should be set to 'binary'. |
please check which version this bug affects, and modify according issue tags |
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Import a moderate data set into different versions of Lightning: 1) 6.2.0; 2) 5.3.3
The reproduce conditions:
2. What did you expect to see? (Required)
3. What did you see instead (Required)
The import time on newer version of Lightning should be roughly the same as the older version
4. What is your TiDB version? (Required)
6.2.0
The text was updated successfully, but these errors were encountered: