Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](group commit) make group commit cancel in time #36249

Merged
merged 1 commit into from
Jun 13, 2024

Conversation

mymeiyi
Copy link
Contributor

@mymeiyi mymeiyi commented Jun 13, 2024

Proposed changes

If group commit time interval is larger than the load timeout, and there is no new client load to reuse the internal group commit load, the group commit can not cancel in time because it stuck in wait:

#0  0x00007f33937a47aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00005651105dbd05 in __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) ()
#2  0x000056511063f385 in std::__condvar::wait_until(std::mutex&, timespec&) ()
#3  0x000056511063dc2e in std::cv_status std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) ()
#4  0x000056511063cedf in std::cv_status std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) ()
#5  0x0000565110824f48 in std::cv_status std::condition_variable::wait_for<long, std::ratio<1l, 1000l> >(std::unique_lock<std::mutex>&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) ()
#6  0x0000565113b5612a in doris::LoadBlockQueue::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*, bool*) ()
#7  0x000056513f900941 in doris::pipeline::GroupCommitOperatorX::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) ()
#8  0x000056513c69c0b6 in doris::pipeline::ScanOperatorX<doris::pipeline::GroupCommitLocalState>::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) ()
#9  0x000056514009d5f1 in doris::pipeline::PipelineTask::execute(bool*) ()
#10 0x00005651400fb24a in doris::pipeline::TaskScheduler::_do_work(unsigned long) ()

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@mymeiyi
Copy link
Contributor Author

mymeiyi commented Jun 13, 2024

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.51% (9013/24684)
Line Coverage: 28.09% (73927/263179)
Region Coverage: 27.55% (38383/139320)
Branch Coverage: 24.23% (19547/80676)
Coverage Report: http://coverage.selectdb-in.cc/coverage/691f825b50b5b96c358c5157a0cdf622900e3504_691f825b50b5b96c358c5157a0cdf622900e3504/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 40354 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 691f825b50b5b96c358c5157a0cdf622900e3504, data reload: false

------ Round 1 ----------------------------------
q1	17757	4354	4233	4233
q2	2022	193	204	193
q3	10450	1189	1215	1189
q4	10216	762	757	757
q5	7553	2692	2651	2651
q6	221	135	136	135
q7	965	623	602	602
q8	9227	2098	2067	2067
q9	8961	6592	6505	6505
q10	9026	3765	3777	3765
q11	469	239	235	235
q12	518	233	230	230
q13	18888	3006	3017	3006
q14	268	226	228	226
q15	511	482	485	482
q16	529	395	377	377
q17	972	634	721	634
q18	8428	7964	7847	7847
q19	11171	1613	1383	1383
q20	649	333	321	321
q21	5076	3180	3893	3180
q22	404	349	336	336
Total cold run time: 124281 ms
Total hot run time: 40354 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4497	4238	4471	4238
q2	393	280	275	275
q3	3123	2929	2996	2929
q4	1989	1691	1633	1633
q5	5459	5519	5541	5519
q6	231	132	132	132
q7	2179	1814	1866	1814
q8	3334	3407	3425	3407
q9	8802	8742	8755	8742
q10	4026	3821	3873	3821
q11	602	515	492	492
q12	795	630	637	630
q13	15938	3166	3193	3166
q14	296	273	285	273
q15	523	479	481	479
q16	510	430	444	430
q17	1804	1523	1497	1497
q18	8045	7844	7785	7785
q19	3002	1721	1661	1661
q20	2975	1869	1870	1869
q21	7124	4729	4925	4729
q22	658	526	602	526
Total cold run time: 76305 ms
Total hot run time: 56047 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173785 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 691f825b50b5b96c358c5157a0cdf622900e3504, data reload: false

query1	949	381	377	377
query2	6431	2406	2207	2207
query3	6626	208	210	208
query4	22692	17491	17314	17314
query5	3612	479	465	465
query6	240	158	164	158
query7	4597	296	301	296
query8	316	283	276	276
query9	8643	2474	2479	2474
query10	592	294	283	283
query11	10714	10086	10049	10049
query12	121	96	92	92
query13	1642	375	371	371
query14	10189	6929	7904	6929
query15	252	208	192	192
query16	7992	307	293	293
query17	1803	557	546	546
query18	2081	292	282	282
query19	203	160	156	156
query20	90	84	81	81
query21	208	128	130	128
query22	4619	4503	4514	4503
query23	34124	33737	33461	33461
query24	10862	2887	2793	2793
query25	599	359	355	355
query26	1112	150	150	150
query27	2407	321	330	321
query28	7034	2107	2086	2086
query29	883	634	633	633
query30	227	152	150	150
query31	968	731	727	727
query32	91	50	53	50
query33	740	273	274	273
query34	904	466	458	458
query35	737	621	597	597
query36	1091	942	929	929
query37	162	66	68	66
query38	2885	2743	2724	2724
query39	858	796	801	796
query40	221	130	124	124
query41	52	51	49	49
query42	114	92	99	92
query43	595	554	511	511
query44	1220	718	719	718
query45	197	161	161	161
query46	1088	713	696	696
query47	1851	1801	1802	1801
query48	375	291	288	288
query49	851	390	416	390
query50	778	389	387	387
query51	6789	6690	6650	6650
query52	109	93	91	91
query53	351	291	281	281
query54	878	463	441	441
query55	75	73	71	71
query56	278	278	271	271
query57	1149	1079	1126	1079
query58	259	239	242	239
query59	3356	3129	3089	3089
query60	286	268	271	268
query61	89	89	105	89
query62	619	474	458	458
query63	312	286	288	286
query64	8829	2218	1774	1774
query65	3175	3103	3079	3079
query66	745	330	329	329
query67	15399	15099	14871	14871
query68	4528	541	526	526
query69	572	460	362	362
query70	1179	1127	1156	1127
query71	405	272	266	266
query72	7166	6022	5487	5487
query73	764	329	325	325
query74	5904	5595	5519	5519
query75	3431	2655	2676	2655
query76	2815	957	947	947
query77	620	284	290	284
query78	10388	9738	9912	9738
query79	2355	508	513	508
query80	941	460	452	452
query81	588	220	217	217
query82	687	100	103	100
query83	254	169	221	169
query84	240	89	83	83
query85	1680	269	260	260
query86	470	316	310	310
query87	3299	3105	3094	3094
query88	4077	2461	2443	2443
query89	480	374	382	374
query90	1749	192	182	182
query91	145	98	100	98
query92	56	51	51	51
query93	2418	508	500	500
query94	1129	189	185	185
query95	403	314	305	305
query96	589	273	271	271
query97	3264	3069	3076	3069
query98	242	205	193	193
query99	1391	853	876	853
Total cold run time: 274565 ms
Total hot run time: 173785 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.38 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 691f825b50b5b96c358c5157a0cdf622900e3504, data reload: false

query1	0.04	0.03	0.04
query2	0.07	0.04	0.04
query3	0.23	0.04	0.05
query4	1.68	0.06	0.07
query5	0.49	0.49	0.50
query6	1.13	0.72	0.72
query7	0.02	0.01	0.01
query8	0.05	0.04	0.04
query9	0.53	0.49	0.49
query10	0.53	0.54	0.54
query11	0.14	0.12	0.11
query12	0.15	0.12	0.12
query13	0.60	0.59	0.60
query14	0.75	0.80	0.76
query15	0.83	0.82	0.82
query16	0.37	0.36	0.35
query17	1.03	1.02	1.00
query18	0.21	0.27	0.25
query19	1.79	1.72	1.82
query20	0.02	0.01	0.01
query21	15.41	0.67	0.66
query22	4.72	7.14	1.68
query23	18.31	1.33	1.21
query24	2.01	0.24	0.22
query25	0.14	0.09	0.10
query26	0.28	0.17	0.17
query27	0.09	0.08	0.08
query28	13.20	1.03	1.00
query29	12.70	3.26	3.26
query30	0.25	0.06	0.06
query31	2.85	0.38	0.38
query32	3.29	0.48	0.46
query33	2.91	2.93	2.95
query34	17.09	4.47	4.44
query35	4.54	4.49	4.56
query36	0.65	0.46	0.46
query37	0.19	0.15	0.15
query38	0.15	0.14	0.15
query39	0.04	0.04	0.04
query40	0.16	0.14	0.15
query41	0.10	0.05	0.05
query42	0.06	0.04	0.06
query43	0.05	0.04	0.04
Total cold run time: 109.85 s
Total hot run time: 30.38 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 13, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@dataroaring dataroaring merged commit 975beea into apache:master Jun 13, 2024
27 of 31 checks passed
dataroaring pushed a commit that referenced this pull request Jun 17, 2024
## Proposed changes

If group commit time interval is larger than the load timeout, and there
is no new client load to reuse the internal group commit load, the group
commit can not cancel in time because it stuck in wait:
```
#0  0x00007f33937a47aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00005651105dbd05 in __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) ()
#2  0x000056511063f385 in std::__condvar::wait_until(std::mutex&, timespec&) ()
#3  0x000056511063dc2e in std::cv_status std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) ()
#4  0x000056511063cedf in std::cv_status std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) ()
#5  0x0000565110824f48 in std::cv_status std::condition_variable::wait_for<long, std::ratio<1l, 1000l> >(std::unique_lock<std::mutex>&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) ()
#6  0x0000565113b5612a in doris::LoadBlockQueue::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*, bool*) ()
#7  0x000056513f900941 in doris::pipeline::GroupCommitOperatorX::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) ()
#8  0x000056513c69c0b6 in doris::pipeline::ScanOperatorX<doris::pipeline::GroupCommitLocalState>::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) ()
#9  0x000056514009d5f1 in doris::pipeline::PipelineTask::execute(bool*) ()
#10 0x00005651400fb24a in doris::pipeline::TaskScheduler::_do_work(unsigned long) ()
```
mymeiyi added a commit to mymeiyi/doris that referenced this pull request Jul 7, 2024
## Proposed changes

If group commit time interval is larger than the load timeout, and there
is no new client load to reuse the internal group commit load, the group
commit can not cancel in time because it stuck in wait:
```
#0  0x00007f33937a47aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00005651105dbd05 in __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) ()
#2  0x000056511063f385 in std::__condvar::wait_until(std::mutex&, timespec&) ()
#3  0x000056511063dc2e in std::cv_status std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) ()
#4  0x000056511063cedf in std::cv_status std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) ()
#5  0x0000565110824f48 in std::cv_status std::condition_variable::wait_for<long, std::ratio<1l, 1000l> >(std::unique_lock<std::mutex>&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) ()
#6  0x0000565113b5612a in doris::LoadBlockQueue::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*, bool*) ()
#7  0x000056513f900941 in doris::pipeline::GroupCommitOperatorX::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) ()
#8  0x000056513c69c0b6 in doris::pipeline::ScanOperatorX<doris::pipeline::GroupCommitLocalState>::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) ()
#9  0x000056514009d5f1 in doris::pipeline::PipelineTask::execute(bool*) ()
#10 0x00005651400fb24a in doris::pipeline::TaskScheduler::_do_work(unsigned long) ()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.5-merged dev/3.0.0-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants