More efficient rolling with large dask arrays #2908

emaroon · 2019-04-19T21:35:59Z

Code Sample

import xarray as xr
import dask.array as da

dsize=[62,12,100,192,288]
array1=da.random.random(dsize,chunks=(dsize[0],dsize[1],1,dsize[3],int(dsize[4]/2)))
array2=xr.DataArray(array1)
rollingmean=array2.rolling(dim_1=3,center=True).mean()  # <-- this kills all workers

Problem description

I'm working on NCAR's cheyenne with a 36GB netcdf using dask_jobqueue.PBSCluster, and trying to calculate the running-mean along one dimension. Despite having plenty of memory reserved (400GB), I can watch DataArray.rolling blow up the bytes stored in the dashboard until the job hangs and all the workers are killed.

The above snippet reproduces the issue with the same array size and chunksize as what I'm working with. This worker-killing behavior does not occur for arrays that are 100x smaller. I've found a speedy way to calculate what I need without using rolling, but I thought I should bring this to your attention regardless.

In case it's relevant, here's how I'm setting up the dask cluster on cheyenne:

from dask.distributed import Client
from dask_jobqueue import PBSCluster  #version 0.4.1

cluster=PBSCluster(cores=36, processes=9, memory='109GB', project=myproj, resource_spec='select=1:ncpus=36:mem=109G', queue='regular', walltime='02:00:00')
numnodes=4
client = Client(cluster)
cluster.scale(numnodes*9)

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.12.62-60.64.8-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.12.1
pandas: 0.24.1
numpy: 1.15.4
scipy: 1.2.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.1
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.5
distributed: 1.26.1
matplotlib: 3.0.2
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 40.6.3
pip: 18.1
conda: 4.6.13
pytest: None
IPython: 7.3.0
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2019-04-19T22:54:42Z

I suspect that if you had bottleneck installed this would actually be OK on memory usage.

Our current solution for cases where bottleneck is not installed involves creating a very large array in memory. It would be nice to allows for a less memory hungry version as an option, even if it will be a little slower.

This also is probably an opportunity for documentation.

emaroon · 2019-04-19T23:25:17Z

Thanks for the quick response! I installed bottleneck (v1.2.1) and that did the trick. The rolling computation is now lazy and nothing is stored in memory until the array is deliberately persisted. Having a hint in the rolling documentation about this would be great.

dcherian · 2019-04-19T23:30:23Z

@emaroon Are you up for sending in a pull request? :)

dcherian added the topic-documentation label Apr 19, 2019

dcherian changed the title ~~DataArray.rolling killing dask workers with large dataarray~~ More efficient rolling with large dask arrays Apr 19, 2019

dcherian added the enhancement label Apr 19, 2019

dcherian mentioned this issue May 2, 2019

Docs/more fixes #2934

Merged

dcherian closed this as completed in #2934 Oct 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient rolling with large dask arrays #2908

More efficient rolling with large dask arrays #2908

emaroon commented Apr 19, 2019

INSTALLED VERSIONS

shoyer commented Apr 19, 2019

emaroon commented Apr 19, 2019

dcherian commented Apr 19, 2019

More efficient rolling with large dask arrays #2908

More efficient rolling with large dask arrays #2908

Comments

emaroon commented Apr 19, 2019

Code Sample

Problem description

Output of xr.show_versions()

INSTALLED VERSIONS

shoyer commented Apr 19, 2019

emaroon commented Apr 19, 2019

dcherian commented Apr 19, 2019

Output of `xr.show_versions()`