Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[SHUFFLE] HugePage support in shuffle #855

Closed
FelixYBW opened this issue Apr 16, 2022 · 1 comment
Closed

[SHUFFLE] HugePage support in shuffle #855

FelixYBW opened this issue Apr 16, 2022 · 1 comment
Labels
enhancement New feature or request feature

Comments

@FelixYBW
Copy link
Collaborator

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

During shuffle, we need to preallocate memory for the all the reducer. If we set prefer spill to false (current it's true but will change to false soon, check issue 840), we need to cache all the split partitions until memory is out of order. It will be in GB level memory. If we still use 4K page, which will leads to huge paga fault number during allocation, and very high DTLB misses during split.

Describe the solution you'd like
Enable large page support:

  1. allocate a large buffer with 2M aligned for all the splitted batches
  2. call madvise(addr,length, MADV_HUGEPAGE ) immediately, page will be allocated as 2M page
  3. slice the buffer for each reducer and each column
  4. Split into dedicated column buffers.

Difficult is the variable length buffer support. Currently we go through the first record batch to get the average length of each variable buffer, then pre-allocate memory as (row * avg_length+1K). During split process, we check if it's enough and re-allocate if not which leads to memcpy usually.

In HugePage support we can still use the policy, if it's not enough, allocate another 4K page based buffer. Fortunately most string buffers in real workload has fixed width.

Another possible solution is to stop split once any variable buffer is full and then allocate the next buffer for split. The drawback of this solution is the waste of other buffers.

@FelixYBW
Copy link
Collaborator Author

FelixYBW commented May 9, 2022

Arrow doesn't support customized alignment allocation. If we use jemalloc, 2M aligned allocation either doesn't show better performance. use madvise function call does decrease the DTLB miss and page walking a lot. But performance gain a little.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request feature
Projects
None yet
Development

No branches or pull requests

2 participants