Skip to content

Pool Allocator

Matt Norman edited this page Aug 15, 2022 · 18 revisions
   ,(   ,(   ,(   ,(   ,(   ,(   ,(   ,(
`-'  `-'  `-'  `-'  `-'  `-'  `-'  `-'  `
   _________________________
 / "Don't be a malloc-hater  \
|   Use the pool alligator!"  |
 \     _____________________ / 
  |  /
  |/       .-._   _ _ _ _ _ _ _ _
.-''-.__.-'00  '-' ' ' ' ' ' ' ' '-.
'.___ '    .   .--_'-' '-' '-' _'-' '._
 V: V 'vv-'   '_   '.       .'  _..' '.'.
   '=.____.=_.--'   :_.__.__:_   '.   : :
           (((____.-'        '-.  /   : :
                             (((-'\ .' /
                           _____..'  .'
                          '-._____.-'
   ,(   ,(   ,(   ,(   ,(   ,(   ,(   ,(
`-'  `-'  `-'  `-'  `-'  `-'  `-'  `-'  `

YAKL has a pool allocator, "Gator", that is automatically turned on and used as long as the hardware backend device has a separate memory space. The reason for the pool is that allocation and free calls on accelerator devices are typically very expensive, and scientific codes often perform allocations and free's very frequently. To facilitate doing the efficiently, a large pool of memory is allocated at YAKL's initialization, and YAKL hands out chunks of the pool during runtime very cheaply.

The thing about a pool allocator is that once your run out of memory in a given pool, you cannot resize the pool. That would invalidate the pointers you've handed out from the initial pool. Rather, you can only add new pools. Therefore, if the arrays you're allocating are "large", and size of individual pools is "small", you may find yourself in situations where no additional pool is large enough to host the size needed for that array. In those cases, YAKL will inform you that your initial pool size is too small.

You control the behavior of Gator's pool management through the following environment variables:

  • GATOR_INITIAL_MB: The initial pool size in MB
  • GATOR_GROW_MB: The size of each new pool in MB once the initial pool is out of memory

YAKL's pool allocator is pretty informative and will try to let you know what to do if an issue occurs. Some features of Gator:

  • Fortran bindings for integer, integer(8), real, real(8), and logical
  • Fortran bindings for arrays of one to seven dimensions
  • Able to call cudaMallocManaged under the hood with prefetching and memset
  • Able to support arbitrary lower bounds in the Fortran interface for Fortran pointers
  • Simple pool allocator implementation that and automatically grows as needed
  • The pool allocator responds to environment variables to control the initial allocation size, and the size of each additional pool as it grows
  • Minimal internal fragmentation for any pattern of allocations and frees
  • Warns the user if allocations are left allocated after the pool is destroyed
  • Thread safe, so feel free to use the pool inside CPU-threaded regions. Gator uses std::mutex to lock and unlock, so it is thread safe for pthreads, std::thread, and OpenMP CPU threads.

The pool search and allocation algorithm is not the fastest, but it is as close to optimal in terms of memory usage and fragmentation as you can get. The cost is typically fine because the cost of allocating data is overlapped with GPU kernel execution in most contexts. Regardless, the cost is still significantly less than most accelerator device calls to malloc and free.

How to determine what to do in case of an error

The heuristics below give you an option for how to simply manage errors by only changing GATOR_INITIAL_MB:

  1. If the error message says you cannot fit the variable in the current pools or in future pools, then you should increase GATOR_INITIAL_MB.
  • GATOR_INITIAL_MB must be larger than the size of the allocation given to you in the error message.
  1. If you've run out of device memory and only one pool has been created, then you should decrease GATOR_INITIAL_MB. The initial pool is requesting more memory than you have available.
  • In particular, ensure GATOR_INITIAL_MB is not larger than the total amount of memory available on the device.
  1. If you've run out of GPU memory and more than one pool has been created, then you should increase GATOR_INITIAL_MB. This creates more room for the variables you could not fit.
  2. If none of this has helped, you are simply using more memory than the device has available.
  • If there are other modules in your code using GPU memory, consider deallocating that before running the current module.
  • Consider using more nodes to decrease the amount of memory required per GPU.
  • Consider executing one of your dimensions in "chunks" if that dimension is trivially parallel to reduce the overall memory requirements.

You can tell when pools are created by specifying -DYAKL_VERBOSE_FILE, which will dump out a verbose file per process recording each internal event inside YAKL as it occurs, including pool creation. grep for Creating pool of to determine how many pools have been created for a given MPI task.

Common Runtime Error Messages:

Your array is too large to fit in the existing pools or any added pools:

  • There isn't enough room in the existing pool for your variable, and the variable allocation size is larger than GATOR_GROW_MB.
  • Increasing GATOR_INITIAL_MB is your best bet. But, of course, do not set GATOR_INITIAL_MB to more memory than the GPU has available. Often a GPU that advertises, say 16GB, has a bit less than that available to you, so reduce the requested size a bit compared to the advertised memory limit.

You've run out of memory

  • You've requested an allocation that cannot fit in existing pools, but adding a new pool failed because there isn't enough memory for it.
  • It's possible you're using your memory inefficiently because individual allocations are large compared to the size of the pool.
  • Again, your best bet here is to increase GATOR_INITIAL_MB given the caveats about available GPU memory mentioned above.
  • If increasing GATOR_INITIAL_MB does not work, then you should consider increasing your node count to decrease the per-GPU memory requirements.
  • Another options is to see if you can process one of your dimensions in smaller "chunks" to see if you can reduce the memory required at any given time.
Clone this wiki locally