[SHUFFLE] HugePage support in shuffle #855

FelixYBW · 2022-04-16T08:12:29Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

During shuffle, we need to preallocate memory for the all the reducer. If we set prefer spill to false (current it's true but will change to false soon, check issue 840), we need to cache all the split partitions until memory is out of order. It will be in GB level memory. If we still use 4K page, which will leads to huge paga fault number during allocation, and very high DTLB misses during split.

Describe the solution you'd like
Enable large page support:

allocate a large buffer with 2M aligned for all the splitted batches
call madvise(addr,length, MADV_HUGEPAGE ) immediately, page will be allocated as 2M page
slice the buffer for each reducer and each column
Split into dedicated column buffers.

Difficult is the variable length buffer support. Currently we go through the first record batch to get the average length of each variable buffer, then pre-allocate memory as (row * avg_length+1K). During split process, we check if it's enough and re-allocate if not which leads to memcpy usually.

In HugePage support we can still use the policy, if it's not enough, allocate another 4K page based buffer. Fortunately most string buffers in real workload has fixed width.

Another possible solution is to stop split once any variable buffer is full and then allocate the next buffer for split. The drawback of this solution is the waste of other buffers.

FelixYBW · 2022-05-09T14:13:02Z

Arrow doesn't support customized alignment allocation. If we use jemalloc, 2M aligned allocation either doesn't show better performance. use madvise function call does decrease the DTLB miss and page walking a lot. But performance gain a little.

FelixYBW added the enhancement New feature or request label Apr 16, 2022

This was referenced May 2, 2022

[NSE-855] allocate large block of memory for all reducer #881

Closed

[NSE-855] allocate large block of memory for all reducer - open PR881 #893

Closed

[NSE-855] allocate large block of memory for all reducer #881 #894

Merged

FelixYBW closed this as completed May 9, 2022

haojinIntel added the feature label Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SHUFFLE] HugePage support in shuffle #855

[SHUFFLE] HugePage support in shuffle #855

FelixYBW commented Apr 16, 2022

FelixYBW commented May 9, 2022

[SHUFFLE] HugePage support in shuffle #855

[SHUFFLE] HugePage support in shuffle #855

Comments

FelixYBW commented Apr 16, 2022

FelixYBW commented May 9, 2022