Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](multi-table-load) fix be core when multi table load pipe finish fail (#36269) #37455

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 8, 2024

pick (#36269)

Proposed changes

*** Current BE git commitID: 5a8ea3079d ***
*** SIGSEGV address not mapped to object (@0x18) received by PID 3726857 (TID 3727585 OR 0x7f0129e83700) from PID 24; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk2/xujianxu/doris/be/src/common/signal_handler.h:421
 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F01D9E87090 in /lib/x86_64-linux-gnu/libc.so.6
 4# std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_State_baseV2::_Setter<doris::Status, doris::Status const&> >::_M_invoke(std::_Any_data const&) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:290
 5# std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/future:593
 6# __pthread_once_slow at /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_once.c:118
 7# std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/future:428
 8# doris::io::MultiTablePipe::_handle_consumer_finished() at /mnt/disk2/xujianxu/doris/be/src/io/fs/multi_table_pipe.cpp:334
 9# doris::io::MultiTablePipe::exec_plans<doris::TPipelineFragmentParams>(doris::ExecEnv*, std::vector<doris::TPipelineFragmentParams, std::allocator<doris::TPipelineFragmentParams> >)::{lambda(doris::RuntimeState*, doris::Status*)#1}::operator()(doris::RuntimeState*, doris::Status*) const at /mnt/disk2/xujianxu/doris/be/src/io/fs/multi_table_pipe.cpp:253
10# doris::pipeline::PipelineFragmentContext::~PipelineFragmentContext() at /mnt/disk2/xujianxu/doris/be/src/pipeline/pipeline_fragment_context.cpp:131
11# std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold() at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/shared_ptr_base.h:199
12# doris::pipeline::_close_task(doris::pipeline::PipelineTask*, doris::Status) at /mnt/disk2/xujianxu/doris/be/src/pipeline/task_scheduler.cpp:95
13# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /mnt/disk2/xujianxu/doris/be/src/pipeline/task_scheduler.cpp:168
14# doris::ThreadPool::dispatch_thread() in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
15# doris::Thread::supervise_thread(void*) at /mnt/disk2/xujianxu/doris/be/src/util/thread.cpp:499
16# start_thread at /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:478
17# __clone at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 

BE will core when multi table load pipe finish fail. For exec_task will return if finish fail, causing ctx was deconstructed. Wait all table finish to solve this problem.

… fail (#36269)

## Proposed changes

```
*** Current BE git commitID: 5a8ea30 ***
*** SIGSEGV address not mapped to object (@0x18) received by PID 3726857 (TID 3727585 OR 0x7f0129e83700) from PID 24; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk2/xujianxu/doris/be/src/common/signal_handler.h:421
 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F01D9E87090 in /lib/x86_64-linux-gnu/libc.so.6
 4# std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_State_baseV2::_Setter<doris::Status, doris::Status const&> >::_M_invoke(std::_Any_data const&) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:290
 5# std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/future:593
 6# __pthread_once_slow at /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_once.c:118
 7# std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/future:428
 8# doris::io::MultiTablePipe::_handle_consumer_finished() at /mnt/disk2/xujianxu/doris/be/src/io/fs/multi_table_pipe.cpp:334
 9# doris::io::MultiTablePipe::exec_plans<doris::TPipelineFragmentParams>(doris::ExecEnv*, std::vector<doris::TPipelineFragmentParams, std::allocator<doris::TPipelineFragmentParams> >)::{lambda(doris::RuntimeState*, doris::Status*)#1}::operator()(doris::RuntimeState*, doris::Status*) const at /mnt/disk2/xujianxu/doris/be/src/io/fs/multi_table_pipe.cpp:253
10# doris::pipeline::PipelineFragmentContext::~PipelineFragmentContext() at /mnt/disk2/xujianxu/doris/be/src/pipeline/pipeline_fragment_context.cpp:131
11# std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold() at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/shared_ptr_base.h:199
12# doris::pipeline::_close_task(doris::pipeline::PipelineTask*, doris::Status) at /mnt/disk2/xujianxu/doris/be/src/pipeline/task_scheduler.cpp:95
13# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /mnt/disk2/xujianxu/doris/be/src/pipeline/task_scheduler.cpp:168
14# doris::ThreadPool::dispatch_thread() in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
15# doris::Thread::supervise_thread(void*) at /mnt/disk2/xujianxu/doris/be/src/util/thread.cpp:499
16# start_thread at /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:478
17# __clone at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 
```

BE will core when multi table load pipe finish fail. For exec_task will
return if finish fail, causing ctx was deconstructed.
Wait all table finish to solve this problem.
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@sollhui
Copy link
Contributor Author

sollhui commented Jul 8, 2024

run buildall

Copy link
Contributor

github-actions bot commented Jul 8, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 49663 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 063353d44cb1f7a425e352a4be663204d53da90e, data reload: false

------ Round 1 ----------------------------------
q1	17960	4364	4323	4323
q2	2037	158	151	151
q3	10456	1894	1892	1892
q4	10291	1231	1343	1231
q5	8438	3896	3903	3896
q6	232	126	126	126
q7	2035	1590	1616	1590
q8	9298	2720	2698	2698
q9	10409	10382	10162	10162
q10	8610	3449	3449	3449
q11	429	238	253	238
q12	470	294	308	294
q13	18355	3998	4026	3998
q14	358	333	322	322
q15	514	460	466	460
q16	666	572	571	571
q17	1124	975	965	965
q18	7339	6828	7029	6828
q19	1780	1645	1629	1629
q20	543	313	298	298
q21	4430	4138	4094	4094
q22	515	448	450	448
Total cold run time: 116289 ms
Total hot run time: 49663 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4334	4364	4306	4306
q2	318	230	225	225
q3	4177	4166	4147	4147
q4	2757	2739	2736	2736
q5	7150	7122	7080	7080
q6	235	119	121	119
q7	3253	2845	2856	2845
q8	4391	4448	4495	4448
q9	16926	16672	16734	16672
q10	4302	4324	4289	4289
q11	773	708	689	689
q12	1028	863	873	863
q13	8749	3768	3764	3764
q14	455	441	425	425
q15	508	472	457	457
q16	738	692	700	692
q17	3893	3921	3868	3868
q18	9356	8871	8868	8868
q19	1741	1716	1714	1714
q20	2400	2137	2129	2129
q21	8582	8525	8780	8525
q22	1139	1103	1040	1040
Total cold run time: 87205 ms
Total hot run time: 79901 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.90% (8116/21417)
Line Coverage: 29.56% (66495/224921)
Region Coverage: 29.03% (34275/118053)
Branch Coverage: 24.91% (17606/70692)
Coverage Report: http://coverage.selectdb-in.cc/coverage/063353d44cb1f7a425e352a4be663204d53da90e_063353d44cb1f7a425e352a4be663204d53da90e/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 203826 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 063353d44cb1f7a425e352a4be663204d53da90e, data reload: false

query1	1940	424	391	391
query2	8043	2858	2702	2702
query3	8161	215	208	208
query4	21371	18071	18108	18071
query5	19731	6504	6511	6504
query6	418	215	231	215
query7	5166	297	311	297
query8	461	420	399	399
query9	3064	2649	2597	2597
query10	457	313	299	299
query11	11351	10683	10642	10642
query12	126	82	77	77
query13	5615	692	684	684
query14	19321	13090	13518	13090
query15	365	268	252	252
query16	6425	280	262	262
query17	1364	1509	879	879
query18	2267	415	433	415
query19	222	153	150	150
query20	80	80	82	80
query21	191	97	91	91
query22	5171	5043	5022	5022
query23	32481	31944	31954	31944
query24	6762	6533	6474	6474
query25	520	434	432	432
query26	534	161	165	161
query27	1746	299	297	297
query28	6168	2336	2313	2313
query29	2866	2689	2660	2660
query30	241	164	167	164
query31	919	733	766	733
query32	70	67	58	58
query33	408	253	251	251
query34	853	478	487	478
query35	1135	900	920	900
query36	1311	1287	1266	1266
query37	94	59	59	59
query38	3044	2958	2941	2941
query39	1389	1337	1316	1316
query40	213	97	97	97
query41	47	45	45	45
query42	89	86	80	80
query43	778	636	664	636
query44	1138	704	714	704
query45	242	241	233	233
query46	1220	962	966	962
query47	1884	1715	1833	1715
query48	1005	722	713	713
query49	624	365	378	365
query50	863	643	613	613
query51	4734	4634	4692	4634
query52	89	84	81	81
query53	448	328	322	322
query54	2625	2475	2435	2435
query55	86	91	82	82
query56	248	224	193	193
query57	1295	1173	1156	1156
query58	223	210	199	199
query59	4394	4030	4187	4030
query60	218	211	201	201
query61	98	93	93	93
query62	776	441	472	441
query63	489	334	341	334
query64	2498	1588	1492	1492
query65	3628	3584	3550	3550
query66	773	382	376	376
query67	15863	16677	15585	15585
query68	8592	650	659	650
query69	580	353	347	347
query70	1637	1462	1407	1407
query71	422	319	317	317
query72	6531	3452	3492	3452
query73	727	326	316	316
query74	6335	5869	5887	5869
query75	4713	3748	3739	3739
query76	4872	1092	1108	1092
query77	679	259	251	251
query78	12501	11605	13064	11605
query79	7907	616	627	616
query80	1387	406	408	406
query81	505	237	232	232
query82	683	101	96	96
query83	175	136	129	129
query84	264	74	68	68
query85	1213	333	330	330
query86	342	300	333	300
query87	3267	3059	3053	3053
query88	4824	2270	2278	2270
query89	353	300	325	300
query90	1932	208	206	206
query91	177	140	142	140
query92	53	51	55	51
query93	3061	602	541	541
query94	765	201	212	201
query95	1112	1064	1032	1032
query96	701	329	317	317
query97	6488	6405	6268	6268
query98	187	172	179	172
query99	2898	912	946	912
Total cold run time: 314322 ms
Total hot run time: 203826 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.23 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 063353d44cb1f7a425e352a4be663204d53da90e, data reload: false

query1	0.02	0.02	0.02
query2	0.07	0.02	0.02
query3	0.25	0.04	0.05
query4	1.79	0.07	0.06
query5	0.53	0.52	0.52
query6	1.24	0.61	0.62
query7	0.01	0.01	0.02
query8	0.04	0.03	0.02
query9	0.52	0.48	0.49
query10	0.54	0.54	0.53
query11	0.12	0.09	0.08
query12	0.12	0.08	0.09
query13	0.62	0.62	0.61
query14	0.79	0.78	0.79
query15	0.78	0.75	0.77
query16	0.35	0.38	0.37
query17	1.00	1.01	1.02
query18	0.22	0.26	0.24
query19	1.94	1.90	1.85
query20	0.01	0.00	0.00
query21	15.48	0.56	0.57
query22	2.01	2.38	1.51
query23	17.54	1.10	0.83
query24	7.55	1.11	0.66
query25	0.38	0.06	0.08
query26	0.82	0.18	0.15
query27	0.05	0.03	0.03
query28	5.85	0.80	0.72
query29	12.68	2.68	2.32
query30	0.55	0.53	0.49
query31	2.81	0.39	0.37
query32	3.40	0.49	0.52
query33	3.08	3.07	3.12
query34	15.27	4.81	4.79
query35	4.86	4.83	4.86
query36	1.05	1.03	1.02
query37	0.06	0.04	0.05
query38	0.03	0.02	0.02
query39	0.02	0.01	0.02
query40	0.16	0.15	0.14
query41	0.07	0.02	0.01
query42	0.03	0.01	0.02
query43	0.02	0.02	0.02
Total cold run time: 104.73 s
Total hot run time: 30.23 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit 063353d44cb1f7a425e352a4be663204d53da90e with default session variables
Stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       21.3 seconds inserted 10000000 Rows, about 469K ops/s

@dataroaring dataroaring merged commit aefc6be into apache:branch-2.0 Jul 8, 2024
23 of 25 checks passed
mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
… fail (apache#36269) (apache#37455)

pick (apache#36269)

## Proposed changes

```
*** Current BE git commitID: 5a8ea30 ***
*** SIGSEGV address not mapped to object (@0x18) received by PID 3726857 (TID 3727585 OR 0x7f0129e83700) from PID 24; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk2/xujianxu/doris/be/src/common/signal_handler.h:421
 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F01D9E87090 in /lib/x86_64-linux-gnu/libc.so.6
 4# std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_State_baseV2::_Setter<doris::Status, doris::Status const&> >::_M_invoke(std::_Any_data const&) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:290
 5# std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/future:593
 6# __pthread_once_slow at /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_once.c:118
 7# std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/future:428
 8# doris::io::MultiTablePipe::_handle_consumer_finished() at /mnt/disk2/xujianxu/doris/be/src/io/fs/multi_table_pipe.cpp:334
 9# doris::io::MultiTablePipe::exec_plans<doris::TPipelineFragmentParams>(doris::ExecEnv*, std::vector<doris::TPipelineFragmentParams, std::allocator<doris::TPipelineFragmentParams> >)::{lambda(doris::RuntimeState*, doris::Status*)apache#1}::operator()(doris::RuntimeState*, doris::Status*) const at /mnt/disk2/xujianxu/doris/be/src/io/fs/multi_table_pipe.cpp:253
10# doris::pipeline::PipelineFragmentContext::~PipelineFragmentContext() at /mnt/disk2/xujianxu/doris/be/src/pipeline/pipeline_fragment_context.cpp:131
11# std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold() at /mnt/disk2/xujianxu/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/shared_ptr_base.h:199
12# doris::pipeline::_close_task(doris::pipeline::PipelineTask*, doris::Status) at /mnt/disk2/xujianxu/doris/be/src/pipeline/task_scheduler.cpp:95
13# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /mnt/disk2/xujianxu/doris/be/src/pipeline/task_scheduler.cpp:168
14# doris::ThreadPool::dispatch_thread() in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
15# doris::Thread::supervise_thread(void*) at /mnt/disk2/xujianxu/doris/be/src/util/thread.cpp:499
16# start_thread at /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:478
17# __clone at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
```

BE will core when multi table load pipe finish fail. For exec_task will
return if finish fail, causing ctx was deconstructed. Wait all table
finish to solve this problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants