Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient rolling with large dask arrays #2908

Closed
emaroon opened this issue Apr 19, 2019 · 3 comments · Fixed by #2934
Closed

More efficient rolling with large dask arrays #2908

emaroon opened this issue Apr 19, 2019 · 3 comments · Fixed by #2934

Comments

@emaroon
Copy link

emaroon commented Apr 19, 2019

Code Sample

import xarray as xr
import dask.array as da

dsize=[62,12,100,192,288]
array1=da.random.random(dsize,chunks=(dsize[0],dsize[1],1,dsize[3],int(dsize[4]/2)))
array2=xr.DataArray(array1)
rollingmean=array2.rolling(dim_1=3,center=True).mean()  # <-- this kills all workers

Problem description

I'm working on NCAR's cheyenne with a 36GB netcdf using dask_jobqueue.PBSCluster, and trying to calculate the running-mean along one dimension. Despite having plenty of memory reserved (400GB), I can watch DataArray.rolling blow up the bytes stored in the dashboard until the job hangs and all the workers are killed.

The above snippet reproduces the issue with the same array size and chunksize as what I'm working with. This worker-killing behavior does not occur for arrays that are 100x smaller. I've found a speedy way to calculate what I need without using rolling, but I thought I should bring this to your attention regardless.

In case it's relevant, here's how I'm setting up the dask cluster on cheyenne:

from dask.distributed import Client
from dask_jobqueue import PBSCluster  #version 0.4.1

cluster=PBSCluster(cores=36, processes=9, memory='109GB', project=myproj, resource_spec='select=1:ncpus=36:mem=109G', queue='regular', walltime='02:00:00')
numnodes=4
client = Client(cluster)
cluster.scale(numnodes*9)

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.12.62-60.64.8-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.12.1
pandas: 0.24.1
numpy: 1.15.4
scipy: 1.2.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.1
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.5
distributed: 1.26.1
matplotlib: 3.0.2
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 40.6.3
pip: 18.1
conda: 4.6.13
pytest: None
IPython: 7.3.0
sphinx: None

@shoyer
Copy link
Member

shoyer commented Apr 19, 2019

I suspect that if you had bottleneck installed this would actually be OK on memory usage.

Our current solution for cases where bottleneck is not installed involves creating a very large array in memory. It would be nice to allows for a less memory hungry version as an option, even if it will be a little slower.

This also is probably an opportunity for documentation.

@dcherian dcherian changed the title DataArray.rolling killing dask workers with large dataarray More efficient rolling with large dask arrays Apr 19, 2019
@emaroon
Copy link
Author

emaroon commented Apr 19, 2019

Thanks for the quick response! I installed bottleneck (v1.2.1) and that did the trick. The rolling computation is now lazy and nothing is stored in memory until the array is deliberately persisted. Having a hint in the rolling documentation about this would be great.

@dcherian
Copy link
Contributor

@emaroon Are you up for sending in a pull request? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants