Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](split_by_string) Fix split by string core on column string #28030

Merged
merged 3 commits into from
Dec 7, 2023

Conversation

zhiqiang-hhhh
Copy link
Contributor

@zhiqiang-hhhh zhiqiang-hhhh commented Dec 5, 2023

wait for another PR to merge to 2.0 branch and then merge this PR.

Applying split_by_string on string column, such as select split_by_string ("abc", string) from tbl, will trigger LOG_FATAL on master and make branch-2.0 core dump on mysql result serde.

Since we will have an invalid block like below:

Int         Array
         Data  Offset
1.       a         3
2.       b
3.       c
4

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh zhiqiang-hhhh changed the title [fix](split_by_string) Fix split by string wrong result and potential core [fix](split_by_string) Fix split by string core on column string Dec 5, 2023
@wm1581066 wm1581066 added the usercase Important user case type label label Dec 5, 2023
@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.38 seconds
stream load tsv: 575 seconds loaded 74807831229 Bytes, about 124 MB/s
stream load json: 18 seconds loaded 2358488459 Bytes, about 124 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17167482211 Bytes

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 6eba8ca60d7ae43857c9a666dfd4bcc53a188a74, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4700	4434	4441	4434
q2	365	146	158	146
q3	1461	1267	1252	1252
q4	1121	949	928	928
q5	3176	3195	3192	3192
q6	251	127	130	127
q7	1006	504	490	490
q8	2212	2215	2203	2203
q9	6697	6690	6680	6680
q10	3218	3261	3278	3261
q11	324	192	207	192
q12	349	212	212	212
q13	4587	4135	3771	3771
q14	239	203	220	203
q15	576	523	523	523
q16	436	387	381	381
q17	1009	642	619	619
q18	7726	7128	8470	7128
q19	1599	1467	1323	1323
q20	525	536	330	330
q21	3125	2694	2811	2694
q22	355	293	294	293
Total cold run time: 45057 ms
Total hot run time: 40382 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4374	4389	4369	4369
q2	267	162	169	162
q3	3538	3514	3507	3507
q4	2382	2365	2383	2365
q5	5742	5752	5739	5739
q6	238	122	123	122
q7	2366	1850	1863	1850
q8	3515	3531	3524	3524
q9	9045	9013	9058	9013
q10	3908	3991	3988	3988
q11	495	383	381	381
q12	774	580	595	580
q13	4317	3582	3533	3533
q14	270	256	254	254
q15	580	522	524	522
q16	494	446	447	446
q17	1863	1858	1851	1851
q18	8601	8253	8325	8253
q19	1727	1752	1756	1752
q20	2272	1944	1934	1934
q21	6475	6112	6143	6112
q22	522	437	419	419
Total cold run time: 63765 ms
Total hot run time: 60676 ms

str_pos += delimiter_ref.size;
const size_t new_size = old_size + split_part_size;
column_string_chars.resize(new_size);
if (split_part_size > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if str_ref.length>16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know. The snippet is copied from #18496

It seems that there is no guarantee for the memory alignment. I will figure it out.

Copy link
Contributor Author

@zhiqiang-hhhh zhiqiang-hhhh Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are safe here, since ColumnString uses PaddedPODArray as its data type.
using PaddedPODArray = PODArray<T, initial_bytes, TAllocator, 15, 16>; means it will always be safe to read/write 16 bytes after end of StrRef of a ColumnString.

Copy link
Contributor

@BiteTheDDDDt BiteTheDDDDt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Dec 7, 2023

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Dec 7, 2023
Copy link
Contributor

github-actions bot commented Dec 7, 2023

PR approved by anyone and no changes requested.

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yiguolei yiguolei merged commit 1a46cf6 into apache:master Dec 7, 2023
@zhiqiang-hhhh zhiqiang-hhhh deleted the fix-splitbystring branch December 7, 2023 09:12
yiguolei pushed a commit that referenced this pull request Dec 8, 2023
…ysql result (#28069)

BE will core dump if result block is invalid when we doing result serialization.
An existing bug case is described in #28030, so we add check branch to avoid BE core dump due to out of range related problem.
zhiqiang-hhhh added a commit to zhiqiang-hhhh/doris that referenced this pull request Dec 13, 2023
zhiqiang-hhhh added a commit to zhiqiang-hhhh/doris that referenced this pull request Dec 14, 2023
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
…ysql result (apache#28069)

BE will core dump if result block is invalid when we doing result serialization.
An existing bug case is described in apache#28030, so we add check branch to avoid BE core dump due to out of range related problem.
zhiqiang-hhhh added a commit to zhiqiang-hhhh/doris that referenced this pull request Jun 24, 2024
…ysql result (apache#28069)

BE will core dump if result block is invalid when we doing result serialization.
An existing bug case is described in apache#28030, so we add check branch to avoid BE core dump due to out of range related problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.4-merged p0_c reviewed usercase Important user case type label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants