Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](multi table) fix single stream multi table memory leak #38255

Merged
merged 2 commits into from
Jul 25, 2024

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 23, 2024

We meet OOM when using single stream multi table
image

It exist memory leak, and heap profile like:
image

The stream load context will not release in some exception conditions as plan failed for high concurrency causing timeout when obtaining read lock. It is introduced by #35458

The solution effect is shown in the following figure, which can run stably with a small amount of memory
image

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@sollhui
Copy link
Contributor Author

sollhui commented Jul 23, 2024

run buildall

@sollhui sollhui changed the title [fix](multi table) fix single stream multi table memory leak [draft](multi table) fix single stream multi table memory leak Jul 23, 2024
@sollhui sollhui marked this pull request as draft July 23, 2024 09:51
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39813 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 09efe2f00128ad8ab394fbc3813cd750746b1136, data reload: false

------ Round 1 ----------------------------------
q1	18486	4436	4301	4301
q2	2021	192	192	192
q3	10451	1176	1124	1124
q4	10189	768	880	768
q5	7565	2681	2670	2670
q6	218	135	137	135
q7	953	594	596	594
q8	9219	2045	2074	2045
q9	8586	6515	6520	6515
q10	8758	3762	3780	3762
q11	450	239	236	236
q12	391	229	222	222
q13	17861	2990	2990	2990
q14	271	237	234	234
q15	542	480	500	480
q16	494	391	371	371
q17	967	672	651	651
q18	8132	7547	7455	7455
q19	4935	1421	1347	1347
q20	702	328	316	316
q21	4831	3126	3242	3126
q22	335	279	280	279
Total cold run time: 116357 ms
Total hot run time: 39813 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4341	4255	4251	4251
q2	382	276	267	267
q3	3035	2753	2775	2753
q4	1877	1603	1599	1599
q5	5295	5311	5309	5309
q6	226	127	129	127
q7	2112	1773	1771	1771
q8	3180	3354	3290	3290
q9	8418	8340	8391	8340
q10	3880	3764	3709	3709
q11	584	483	501	483
q12	774	606	603	603
q13	17100	2981	2968	2968
q14	292	276	266	266
q15	508	474	480	474
q16	478	408	410	408
q17	1776	1479	1458	1458
q18	7836	7423	7488	7423
q19	1628	1509	1578	1509
q20	2003	1777	1792	1777
q21	4864	4740	4661	4661
q22	556	492	501	492
Total cold run time: 71145 ms
Total hot run time: 53938 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 174015 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 09efe2f00128ad8ab394fbc3813cd750746b1136, data reload: false

query1	917	363	364	363
query2	6436	1892	1867	1867
query3	6668	204	216	204
query4	29101	17561	17230	17230
query5	3901	479	478	478
query6	278	177	168	168
query7	4582	301	301	301
query8	254	201	194	194
query9	8774	2460	2434	2434
query10	437	291	276	276
query11	11288	10217	9969	9969
query12	143	86	84	84
query13	1659	373	397	373
query14	10389	7769	7682	7682
query15	309	173	169	169
query16	8010	476	470	470
query17	1597	568	550	550
query18	2036	290	283	283
query19	203	156	156	156
query20	88	83	82	82
query21	215	130	124	124
query22	4249	4239	4095	4095
query23	33976	33354	33258	33258
query24	10790	2864	2863	2863
query25	645	388	387	387
query26	1394	150	151	150
query27	2854	285	279	279
query28	7195	2064	2052	2052
query29	930	646	628	628
query30	290	153	152	152
query31	980	741	759	741
query32	94	53	59	53
query33	775	360	350	350
query34	911	484	488	484
query35	885	775	726	726
query36	1139	953	919	919
query37	152	82	84	82
query38	2853	2739	2742	2739
query39	878	786	825	786
query40	273	125	122	122
query41	50	46	49	46
query42	129	104	108	104
query43	518	468	479	468
query44	1217	737	740	737
query45	195	165	164	164
query46	1085	741	752	741
query47	1875	1772	1783	1772
query48	384	298	303	298
query49	1191	420	418	418
query50	800	416	397	397
query51	6782	6717	6582	6582
query52	113	94	98	94
query53	354	294	287	287
query54	922	445	447	445
query55	76	73	73	73
query56	291	276	275	275
query57	1194	1084	1026	1026
query58	257	249	270	249
query59	2912	2753	2658	2658
query60	296	286	279	279
query61	99	98	95	95
query62	825	645	654	645
query63	330	295	295	295
query64	9789	2210	6009	2210
query65	3153	3098	3104	3098
query66	933	331	345	331
query67	15731	15083	14937	14937
query68	8228	570	568	568
query69	757	451	357	357
query70	1223	1141	1078	1078
query71	502	282	283	282
query72	8396	5425	5759	5425
query73	770	337	328	328
query74	6221	5630	5636	5630
query75	4895	2674	2690	2674
query76	4949	916	951	916
query77	756	312	322	312
query78	9719	9082	12508	9082
query79	6475	526	527	526
query80	987	475	479	475
query81	587	220	229	220
query82	424	138	132	132
query83	318	216	173	173
query84	281	88	86	86
query85	1259	318	307	307
query86	428	327	324	324
query87	3298	3193	3143	3143
query88	4334	2430	2411	2411
query89	484	374	379	374
query90	2100	195	195	195
query91	134	101	102	101
query92	62	51	52	51
query93	2150	504	508	504
query94	1399	280	304	280
query95	422	317	330	317
query96	597	276	278	276
query97	3215	2980	3005	2980
query98	220	195	203	195
query99	1601	1246	1264	1246
Total cold run time: 297240 ms
Total hot run time: 174015 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.21 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 09efe2f00128ad8ab394fbc3813cd750746b1136, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.22	0.06	0.06
query4	1.66	0.10	0.11
query5	0.50	0.48	0.49
query6	1.13	0.72	0.73
query7	0.02	0.01	0.02
query8	0.05	0.04	0.04
query9	0.58	0.49	0.50
query10	0.55	0.55	0.54
query11	0.15	0.11	0.11
query12	0.15	0.12	0.12
query13	0.59	0.58	0.58
query14	0.75	0.79	0.78
query15	0.86	0.80	0.81
query16	0.35	0.35	0.37
query17	0.97	0.98	0.96
query18	0.22	0.22	0.21
query19	1.79	1.68	1.68
query20	0.01	0.00	0.00
query21	15.40	0.77	0.67
query22	4.46	7.61	1.51
query23	18.27	1.43	1.26
query24	2.15	0.24	0.22
query25	0.16	0.09	0.08
query26	0.30	0.21	0.20
query27	0.46	0.23	0.23
query28	13.18	1.03	1.00
query29	12.58	3.31	3.31
query30	0.25	0.06	0.05
query31	2.85	0.40	0.39
query32	3.25	0.47	0.47
query33	2.88	2.93	2.94
query34	16.97	4.38	4.36
query35	4.43	4.37	4.47
query36	0.67	0.46	0.46
query37	0.18	0.16	0.16
query38	0.16	0.14	0.15
query39	0.05	0.04	0.03
query40	0.16	0.13	0.13
query41	0.10	0.05	0.05
query42	0.05	0.05	0.05
query43	0.05	0.05	0.04
Total cold run time: 109.68 s
Total hot run time: 30.21 s

@sollhui sollhui marked this pull request as ready for review July 24, 2024 06:23
@sollhui
Copy link
Contributor Author

sollhui commented Jul 24, 2024

run buildall

@sollhui sollhui changed the title [draft](multi table) fix single stream multi table memory leak [fix](multi table) fix single stream multi table memory leak Jul 24, 2024
@sollhui
Copy link
Contributor Author

sollhui commented Jul 24, 2024

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39629 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b23b3901295a8bde756c30247c436c65152ce54e, data reload: false

------ Round 1 ----------------------------------
q1	18096	5292	4267	4267
q2	2023	194	197	194
q3	10509	1143	1132	1132
q4	10236	713	761	713
q5	7562	2706	2644	2644
q6	220	136	135	135
q7	957	598	594	594
q8	9219	2065	2108	2065
q9	8555	6527	6538	6527
q10	8782	3770	3758	3758
q11	460	231	238	231
q12	393	221	218	218
q13	17780	2993	2969	2969
q14	283	237	240	237
q15	531	487	486	486
q16	498	383	371	371
q17	974	714	676	676
q18	8112	7514	7302	7302
q19	5757	1335	1351	1335
q20	662	315	326	315
q21	4969	3173	3190	3173
q22	344	287	287	287
Total cold run time: 116922 ms
Total hot run time: 39629 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4383	4320	4234	4234
q2	369	256	260	256
q3	3053	2878	2937	2878
q4	2006	1702	1724	1702
q5	5658	5517	5475	5475
q6	225	132	132	132
q7	2216	1895	1837	1837
q8	3315	3399	3441	3399
q9	8792	8807	9007	8807
q10	4126	3943	3745	3745
q11	604	527	518	518
q12	820	642	659	642
q13	17400	3182	3151	3151
q14	338	277	315	277
q15	525	513	494	494
q16	494	459	432	432
q17	1827	1553	1489	1489
q18	8133	7941	7868	7868
q19	1794	1628	1538	1538
q20	2245	1862	1866	1862
q21	7994	4888	4751	4751
q22	646	532	517	517
Total cold run time: 76963 ms
Total hot run time: 56004 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173558 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b23b3901295a8bde756c30247c436c65152ce54e, data reload: false

query1	914	382	368	368
query2	6440	1893	1862	1862
query3	6654	205	214	205
query4	28353	17640	17276	17276
query5	3770	489	470	470
query6	280	160	163	160
query7	4595	298	301	298
query8	254	203	197	197
query9	8499	2419	2385	2385
query10	435	292	272	272
query11	11639	9979	10017	9979
query12	114	82	88	82
query13	1655	381	379	379
query14	10325	7753	7554	7554
query15	215	166	175	166
query16	7670	474	471	471
query17	1355	542	534	534
query18	1851	277	274	274
query19	194	144	145	144
query20	89	82	81	81
query21	210	129	129	129
query22	4309	4141	3957	3957
query23	34162	34021	33647	33647
query24	10904	2870	2845	2845
query25	597	399	378	378
query26	702	146	150	146
query27	2291	269	279	269
query28	6066	2128	2122	2122
query29	905	646	649	646
query30	252	158	154	154
query31	1006	780	798	780
query32	98	57	53	53
query33	656	334	342	334
query34	952	526	508	508
query35	893	731	750	731
query36	1186	980	978	978
query37	142	82	82	82
query38	2931	2764	2754	2754
query39	894	807	806	806
query40	200	117	115	115
query41	46	43	44	43
query42	119	102	98	98
query43	511	467	463	463
query44	1082	729	719	719
query45	191	155	185	155
query46	1090	722	705	705
query47	1849	1766	1760	1760
query48	357	312	284	284
query49	837	394	404	394
query50	774	379	386	379
query51	6753	6741	6695	6695
query52	105	94	88	88
query53	365	290	288	288
query54	856	446	435	435
query55	75	70	72	70
query56	280	267	269	267
query57	1113	1051	1085	1051
query58	236	251	247	247
query59	2855	2778	2607	2607
query60	284	276	276	276
query61	95	92	93	92
query62	799	650	648	648
query63	318	288	284	284
query64	9134	2286	1707	1707
query65	3185	3105	3135	3105
query66	729	331	330	330
query67	15415	15167	14961	14961
query68	4514	549	557	549
query69	475	327	339	327
query70	1170	1154	1143	1143
query71	378	286	283	283
query72	6932	5956	5446	5446
query73	741	321	323	321
query74	6146	5690	5744	5690
query75	3368	2700	2657	2657
query76	2124	980	925	925
query77	421	306	299	299
query78	9702	8979	9741	8979
query79	2326	515	522	515
query80	2511	469	532	469
query81	599	220	228	220
query82	807	141	135	135
query83	306	164	171	164
query84	252	89	81	81
query85	2122	306	289	289
query86	360	313	330	313
query87	3281	3129	3091	3091
query88	4031	2362	2377	2362
query89	486	380	378	378
query90	1641	190	183	183
query91	127	100	98	98
query92	60	49	49	49
query93	2550	514	512	512
query94	708	288	286	286
query95	417	327	312	312
query96	609	269	266	266
query97	3232	3017	3019	3017
query98	229	199	189	189
query99	1533	1268	1276	1268
Total cold run time: 278268 ms
Total hot run time: 173558 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.79 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b23b3901295a8bde756c30247c436c65152ce54e, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.23	0.05	0.05
query4	1.68	0.08	0.08
query5	0.47	0.47	0.47
query6	1.12	0.73	0.72
query7	0.02	0.01	0.01
query8	0.05	0.05	0.04
query9	0.55	0.49	0.48
query10	0.55	0.54	0.51
query11	0.17	0.12	0.11
query12	0.14	0.12	0.13
query13	0.61	0.64	0.59
query14	0.76	0.79	0.78
query15	0.86	0.83	0.82
query16	0.35	0.37	0.38
query17	0.95	0.99	1.01
query18	0.24	0.23	0.22
query19	1.85	1.75	1.87
query20	0.02	0.01	0.01
query21	15.46	0.79	0.65
query22	3.85	7.60	1.93
query23	18.34	1.36	1.28
query24	2.06	0.23	0.24
query25	0.14	0.08	0.08
query26	0.29	0.22	0.21
query27	0.45	0.23	0.24
query28	13.27	1.04	1.01
query29	12.64	3.35	3.32
query30	0.25	0.06	0.06
query31	2.87	0.40	0.41
query32	3.25	0.50	0.49
query33	2.91	2.98	2.94
query34	17.19	4.30	4.39
query35	4.41	4.42	4.40
query36	0.67	0.50	0.47
query37	0.20	0.16	0.15
query38	0.16	0.15	0.15
query39	0.04	0.04	0.04
query40	0.16	0.12	0.13
query41	0.09	0.05	0.05
query42	0.07	0.04	0.05
query43	0.05	0.04	0.04
Total cold run time: 109.56 s
Total hot run time: 30.79 s

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 24, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit afc5593 into apache:master Jul 25, 2024
27 of 30 checks passed
dataroaring pushed a commit that referenced this pull request Jul 30, 2024
We meet OOM when using single stream multi table

![image](https://github.com/user-attachments/assets/748e9914-d591-4f41-8b28-412d3cecc841)

It exist memory leak, and heap profile like:

![image](https://github.com/user-attachments/assets/af30c593-88ea-44f6-bba1-82436b13f99f)

The stream load context will not release in some exception conditions as
plan failed for high concurrency causing timeout when obtaining read
lock. It is introduced by #35458

The solution effect is shown in the following figure, which can run
stably with a small amount of memory

![image](https://github.com/user-attachments/assets/4483e0a5-6c0c-4cdc-b8ed-3408da6a86b2)
dataroaring pushed a commit that referenced this pull request Aug 4, 2024
…#38824)

pick (#38255)

We meet OOM when using single stream multi table


![image](https://github.com/user-attachments/assets/748e9914-d591-4f41-8b28-412d3cecc841)

It exist memory leak, and heap profile like:


![image](https://github.com/user-attachments/assets/af30c593-88ea-44f6-bba1-82436b13f99f)

The stream load context will not release in some exception conditions as
plan failed for high concurrency causing timeout when obtaining read
lock. It is introduced by #35458

The solution effect is shown in the following figure, which can run
stably with a small amount of memory


![image](https://github.com/user-attachments/assets/4483e0a5-6c0c-4cdc-b8ed-3408da6a86b2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.6-merged dev/3.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants