Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](cooldown)No need to rdlock inside get_cooldown_tablets, there's enough rdlock inside tablet internal function calls. #39211

Merged
merged 1 commit into from
Aug 30, 2024

Conversation

biohazard4321
Copy link
Contributor

Proposed changes

There's a recurse rdlock inside get_cooldown;when using libcxx,libcxx‘s shared_mutex dosen't support recursive lock, so causes hangs below:
#0 0x00007f788b101a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00005636b8ff996f in std::__1::condition_variable::wait(std::__1::unique_lockstd::__1::mutex&) ()
#2 0x00005636b8ffad8b in std::__1::__shared_mutex_base::lock_shared() ()
#3 0x00005636addc524c in doris::Tablet::_has_data_to_cooldown() ()
#4 0x00005636addc353b in doris::Tablet::pick_cooldown_rowset() ()
#5 0x00005636addc56b9 in doris::Tablet::need_cooldown(long*, unsigned long*) ()
#6 0x00005636adde52f7 in doris::TabletManager::get_cooldown_tablets(std::__1::vector<std::__1::shared_ptrdoris::Tablet, std::__1::allocator<std::__1::shared_ptrdoris::Tablet > >, std::__1::vector<std::__1::shared_ptrdoris::Rowset, std::__1::allocator<std::__1::shared_ptrdoris::Rowset > >, std::__1::function<bool (std::__1::shared_ptrdoris::Tablet const&)>) ()
#7 0x00005636adb0ed81 in doris::StorageEngine::_cooldown_tasks_producer_callback() ()
#8 0x00005636ae17a253 in doris::Thread::supervise_thread(void*) ()
#9 0x00007f788b0fdea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f788b916b0d in clone () from /lib64/libc.so.6

review the code, there's no need a rdlock inside TabletManager::get_cooldown_tablets

…rdlock inside tablet internal function calls.
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@github-actions github-actions bot added the doing label Aug 12, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@biohazard4321
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39925 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 06fcc6153de744289eab30a30102738685cb293b, data reload: false

------ Round 1 ----------------------------------
q1	18003	4563	4535	4535
q2	2692	189	191	189
q3	11810	1197	1105	1105
q4	10549	749	727	727
q5	7667	2604	2539	2539
q6	231	141	142	141
q7	990	616	607	607
q8	9573	1964	1936	1936
q9	8845	6583	6539	6539
q10	7019	2222	2217	2217
q11	468	248	253	248
q12	395	224	228	224
q13	17762	2981	2990	2981
q14	284	238	237	237
q15	534	484	491	484
q16	511	386	395	386
q17	1019	741	733	733
q18	8079	7497	7297	7297
q19	2895	1004	1066	1004
q20	683	315	353	315
q21	5895	4470	4539	4470
q22	1094	1013	1011	1011
Total cold run time: 116998 ms
Total hot run time: 39925 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4439	4254	4279	4254
q2	376	272	278	272
q3	2870	2663	2593	2593
q4	1940	1630	1623	1623
q5	5258	5283	5284	5283
q6	226	132	131	131
q7	2051	1638	1705	1638
q8	3197	3337	3350	3337
q9	8479	8421	8425	8421
q10	3370	3169	3160	3160
q11	597	512	484	484
q12	769	611	571	571
q13	16506	3001	2987	2987
q14	293	272	266	266
q15	536	481	480	480
q16	475	436	421	421
q17	1794	1483	1518	1483
q18	7685	7475	7323	7323
q19	1707	1611	1560	1560
q20	1992	1794	1772	1772
q21	5340	5033	5148	5033
q22	1080	994	987	987
Total cold run time: 70980 ms
Total hot run time: 54079 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 201114 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 06fcc6153de744289eab30a30102738685cb293b, data reload: false

query1	926	383	364	364
query2	6472	1955	1895	1895
query3	6661	210	215	210
query4	31272	23257	23132	23132
query5	4230	501	502	501
query6	282	163	163	163
query7	4587	297	291	291
query8	239	218	198	198
query9	8809	2478	2471	2471
query10	567	506	444	444
query11	16583	15035	14906	14906
query12	164	98	96	96
query13	1638	407	363	363
query14	10375	6754	7606	6754
query15	276	216	202	202
query16	7463	507	525	507
query17	1644	579	550	550
query18	1405	295	290	290
query19	189	148	148	148
query20	112	105	107	105
query21	207	107	101	101
query22	4395	3917	3905	3905
query23	34142	33041	33058	33041
query24	12167	2664	2541	2541
query25	680	385	410	385
query26	1704	156	155	155
query27	2851	284	290	284
query28	7518	2042	2032	2032
query29	1067	435	427	427
query30	314	148	148	148
query31	987	754	764	754
query32	95	57	57	57
query33	760	295	292	292
query34	925	465	475	465
query35	956	814	816	814
query36	1108	950	899	899
query37	279	81	87	81
query38	4290	4181	4360	4181
query39	1452	1356	1389	1356
query40	278	117	118	117
query41	51	47	49	47
query42	122	98	101	98
query43	504	472	483	472
query44	1229	739	726	726
query45	231	204	199	199
query46	1097	749	722	722
query47	1835	1772	1755	1755
query48	375	301	298	298
query49	1193	437	511	437
query50	797	423	420	420
query51	6848	6649	6550	6550
query52	103	93	86	86
query53	262	185	182	182
query54	965	451	436	436
query55	76	78	76	76
query56	283	238	240	238
query57	1172	1106	1086	1086
query58	228	222	230	222
query59	2964	2741	2851	2741
query60	286	269	261	261
query61	95	110	94	94
query62	814	651	644	644
query63	216	188	182	182
query64	10466	2264	1794	1794
query65	3191	3146	3144	3144
query66	1252	350	325	325
query67	15551	15325	14883	14883
query68	8554	557	560	557
query69	476	422	413	413
query70	1141	1090	1112	1090
query71	553	279	274	274
query72	20261	17295	17598	17295
query73	1433	331	329	329
query74	9853	8740	8695	8695
query75	4812	2694	2728	2694
query76	4917	996	992	992
query77	778	315	306	306
query78	9610	10119	8972	8972
query79	9881	530	528	528
query80	1009	508	512	508
query81	603	251	231	231
query82	648	150	139	139
query83	335	149	150	149
query84	262	81	78	78
query85	1007	278	282	278
query86	314	265	303	265
query87	4844	4532	4541	4532
query88	4330	2495	2510	2495
query89	524	290	287	287
query90	1826	204	201	201
query91	125	114	98	98
query92	61	51	51	51
query93	7302	551	531	531
query94	636	307	303	303
query95	407	260	261	260
query96	618	280	274	274
query97	3225	3084	3079	3079
query98	224	206	199	199
query99	1744	1290	1268	1268
Total cold run time: 334588 ms
Total hot run time: 201114 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.84 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 06fcc6153de744289eab30a30102738685cb293b, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.22	0.05	0.06
query4	1.68	0.08	0.07
query5	0.51	0.50	0.49
query6	1.13	0.73	0.72
query7	0.02	0.02	0.01
query8	0.05	0.04	0.05
query9	0.55	0.51	0.50
query10	0.54	0.56	0.55
query11	0.16	0.12	0.12
query12	0.15	0.12	0.13
query13	0.60	0.60	0.58
query14	0.76	0.78	0.80
query15	0.86	0.83	0.81
query16	0.38	0.37	0.38
query17	1.04	0.96	1.05
query18	0.24	0.22	0.23
query19	1.79	1.76	1.73
query20	0.01	0.01	0.01
query21	15.42	0.74	0.64
query22	4.53	6.82	2.17
query23	18.22	1.43	1.25
query24	2.11	0.23	0.22
query25	0.15	0.08	0.08
query26	0.29	0.21	0.22
query27	0.46	0.24	0.22
query28	13.26	1.03	1.00
query29	12.62	3.29	3.25
query30	0.24	0.05	0.05
query31	2.89	0.41	0.39
query32	3.23	0.49	0.47
query33	2.94	2.91	2.95
query34	16.99	4.36	4.32
query35	4.42	4.41	4.37
query36	0.67	0.48	0.47
query37	0.19	0.16	0.15
query38	0.16	0.16	0.16
query39	0.04	0.04	0.04
query40	0.16	0.12	0.14
query41	0.10	0.05	0.05
query42	0.07	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 110.01 s
Total hot run time: 30.84 s

@biohazard4321
Copy link
Contributor Author

run beut

@biohazard4321
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39706 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 06fcc6153de744289eab30a30102738685cb293b, data reload: false

------ Round 1 ----------------------------------
q1	17951	4525	4407	4407
q2	2633	185	183	183
q3	11760	1177	1093	1093
q4	10204	801	769	769
q5	7550	2522	2536	2522
q6	233	145	145	145
q7	999	606	608	606
q8	9298	1892	1943	1892
q9	8694	6506	6533	6506
q10	6987	2186	2230	2186
q11	493	248	255	248
q12	402	225	221	221
q13	17765	3020	2963	2963
q14	284	236	238	236
q15	530	475	501	475
q16	487	406	387	387
q17	969	682	695	682
q18	8151	7538	7413	7413
q19	4554	1019	1041	1019
q20	722	317	341	317
q21	5341	4437	4475	4437
q22	1076	1014	999	999
Total cold run time: 117083 ms
Total hot run time: 39706 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4500	4282	4309	4282
q2	379	284	276	276
q3	2854	2639	2607	2607
q4	1901	1692	1616	1616
q5	5272	5247	5311	5247
q6	227	130	133	130
q7	2067	1725	1662	1662
q8	3135	3323	3372	3323
q9	8394	8364	8375	8364
q10	3365	3130	3149	3130
q11	590	491	487	487
q12	804	592	577	577
q13	16461	2985	3001	2985
q14	316	281	276	276
q15	527	479	478	478
q16	472	427	415	415
q17	1791	1487	1469	1469
q18	7561	7491	7359	7359
q19	1683	1505	1555	1505
q20	1981	1809	1766	1766
q21	5304	5161	5105	5105
q22	1097	950	1008	950
Total cold run time: 70681 ms
Total hot run time: 54009 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 201213 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 06fcc6153de744289eab30a30102738685cb293b, data reload: false

query1	922	366	357	357
query2	6449	1891	1779	1779
query3	6652	213	219	213
query4	33961	23185	23060	23060
query5	4200	495	492	492
query6	291	186	187	186
query7	4589	300	299	299
query8	242	211	200	200
query9	8561	2494	2496	2494
query10	539	462	469	462
query11	16157	14946	14909	14909
query12	146	96	92	92
query13	1633	370	359	359
query14	10182	7635	7327	7327
query15	283	209	217	209
query16	7977	476	468	468
query17	1732	555	556	555
query18	2022	294	269	269
query19	191	143	139	139
query20	115	102	102	102
query21	204	98	95	95
query22	4280	4151	3957	3957
query23	33702	33284	33262	33262
query24	10934	2584	2483	2483
query25	600	361	407	361
query26	1586	158	152	152
query27	2924	287	283	283
query28	7588	2044	2026	2026
query29	960	403	409	403
query30	303	145	146	145
query31	967	728	783	728
query32	101	56	56	56
query33	748	275	279	275
query34	980	465	479	465
query35	960	844	816	816
query36	1061	926	951	926
query37	149	83	80	80
query38	4235	4139	4141	4139
query39	1421	1398	1399	1398
query40	284	113	113	113
query41	46	46	45	45
query42	108	98	97	97
query43	499	474	473	473
query44	1168	722	735	722
query45	231	207	213	207
query46	1084	721	768	721
query47	1831	1766	1736	1736
query48	376	309	330	309
query49	1162	408	415	408
query50	799	409	424	409
query51	6800	6690	6645	6645
query52	107	91	97	91
query53	255	184	174	174
query54	912	447	443	443
query55	74	73	77	73
query56	263	238	242	238
query57	1158	1055	1078	1055
query58	239	245	230	230
query59	2951	2798	2910	2798
query60	293	259	258	258
query61	100	96	99	96
query62	842	642	650	642
query63	213	218	176	176
query64	10470	2269	1737	1737
query65	3193	3134	3178	3134
query66	1052	347	337	337
query67	15131	14842	14738	14738
query68	4645	557	544	544
query69	397	361	399	361
query70	1164	1121	1154	1121
query71	412	271	275	271
query72	19150	17393	17235	17235
query73	750	330	330	330
query74	9147	8673	8701	8673
query75	3372	2690	2626	2626
query76	2827	912	951	912
query77	483	300	307	300
query78	9364	9061	9008	9008
query79	1989	516	526	516
query80	1245	493	500	493
query81	582	222	220	220
query82	810	136	140	136
query83	215	161	154	154
query84	241	82	76	76
query85	1994	276	272	272
query86	500	318	306	306
query87	4611	4526	4664	4526
query88	3947	2523	2575	2523
query89	397	286	292	286
query90	1831	208	199	199
query91	122	94	98	94
query92	67	53	48	48
query93	2361	539	533	533
query94	895	271	292	271
query95	350	268	272	268
query96	615	281	283	281
query97	3236	3060	3067	3060
query98	218	202	198	198
query99	1681	1246	1261	1246
Total cold run time: 311806 ms
Total hot run time: 201213 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.34 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 06fcc6153de744289eab30a30102738685cb293b, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.23	0.05	0.05
query4	1.66	0.08	0.08
query5	0.49	0.49	0.48
query6	1.13	0.72	0.72
query7	0.02	0.01	0.02
query8	0.05	0.04	0.04
query9	0.54	0.50	0.48
query10	0.54	0.55	0.55
query11	0.15	0.12	0.11
query12	0.15	0.12	0.12
query13	0.60	0.60	0.58
query14	0.77	0.78	0.78
query15	0.87	0.81	0.82
query16	0.37	0.38	0.37
query17	0.99	0.96	1.06
query18	0.23	0.21	0.21
query19	1.83	1.72	1.69
query20	0.01	0.01	0.01
query21	15.41	0.77	0.66
query22	3.73	5.76	2.66
query23	18.29	1.44	1.35
query24	2.18	0.24	0.21
query25	0.15	0.09	0.08
query26	0.31	0.22	0.21
query27	0.46	0.22	0.23
query28	13.25	1.00	0.99
query29	12.58	3.33	3.30
query30	0.23	0.04	0.04
query31	2.88	0.39	0.38
query32	3.27	0.48	0.47
query33	2.91	2.94	2.89
query34	17.27	4.38	4.34
query35	4.44	4.40	4.37
query36	0.64	0.46	0.46
query37	0.19	0.15	0.16
query38	0.16	0.14	0.15
query39	0.05	0.04	0.04
query40	0.16	0.12	0.13
query41	0.10	0.05	0.05
query42	0.05	0.04	0.04
query43	0.04	0.04	0.04
Total cold run time: 109.5 s
Total hot run time: 31.34 s

@biohazard4321
Copy link
Contributor Author

run cloud_p1

@biohazard4321
Copy link
Contributor Author

run external

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 30, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@gavinchou gavinchou merged commit 263faea into apache:master Aug 30, 2024
29 of 31 checks passed
dataroaring pushed a commit that referenced this pull request Oct 9, 2024
… enough rdlock inside tablet internal function calls. (#39211)

## Proposed changes
There's a recurse rdlock inside get_cooldown;when using libcxx,libcxx‘s
shared_mutex dosen't support recursive lock, so causes hangs below:
#0 0x00007f788b101a35 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x00005636b8ff996f in
std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
()
#2 0x00005636b8ffad8b in std::__1::__shared_mutex_base::lock_shared() ()
#3  0x00005636addc524c in doris::Tablet::_has_data_to_cooldown() ()
#4  0x00005636addc353b in doris::Tablet::pick_cooldown_rowset() ()
#5 0x00005636addc56b9 in doris::Tablet::need_cooldown(long*, unsigned
long*) ()
#6 0x00005636adde52f7 in
doris::TabletManager::get_cooldown_tablets(std::__1::vector<std::__1::shared_ptr<doris::Tablet>,
std::__1::allocator<std::__1::shared_ptr<doris::Tablet> > >*,
std::__1::vector<std::__1::shared_ptr<doris::Rowset>,
std::__1::allocator<std::__1::shared_ptr<doris::Rowset> > >*,
std::__1::function<bool (std::__1::shared_ptr<doris::Tablet> const&)>)
()
#7 0x00005636adb0ed81 in
doris::StorageEngine::_cooldown_tasks_producer_callback() ()
#8  0x00005636ae17a253 in doris::Thread::supervise_thread(void*) ()
#9  0x00007f788b0fdea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f788b916b0d in clone () from /lib64/libc.so.6

review the code, there's no need a rdlock inside
TabletManager::get_cooldown_tablets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.3-merged doing reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants