Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problem when doing computation between two arrays with discontinuous indexes #3755

Open
vincentchabot opened this issue Feb 6, 2020 · 0 comments

Comments

@vincentchabot
Copy link

MCVE Code Sample

import xarray as xr 
import numpy as np 

# Creating array
ds = xr.Dataset()
ds["longitude"] = np.arange(0.1,2000.1,1)
ds["latitude"] = range(3000)
ds["step"] = range(50)
ds["field"] = (("step","latitude","longitude"),np.random.randn(50,3000,2000))
ds.to_netcdf("big_array.nc")

# Create another array 
ds = xr.Dataset()
ds["longitude"] = np.arange(500.1,600.1,1) # Coordinate are a continuous subset of the first array
ds["latitude"] = np.arange(510,660)
ds["id"] = range(50)
ds["field"] = (("longitude","latitude","id"),np.random.randn(100,150,50))
ds.to_netcdf("slicing.nc")

# Create another array with "discontinuity" in longitude dimension 
ds = xr.Dataset()
ds["longitude"] = list(np.arange(500.1,598.1,1)) +[622.1,640.1]
ds["latitude"] = range(510,660)
ds["id"] = range(10)
ds["mask"] = (("longitude","latitude","id"),np.random.randn(100,150,10))
ds.to_netcdf("no_slicing.nc")

# Load the Three arrays 

da = xr.open_dataset("big_array.nc")
db = xr.open_dataset("slicing.nc").isel(id=0)
dc = xr.open_dataarray("no_slicing.nc").isel(id=0)

%timeit da*db
# 32.3 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit da*dc
#2.13 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def slicing_operation(dc,da):
    """
    Slicing when knowing that dc is a subpart of da
    """
    import operator
    min_lat = np.max([dc.latitude.values.min(),da.latitude.values.min()])
    max_lat = np.min([dc.latitude.values.max(),da.latitude.values.max()])
    index_lat_field = operator.and_(da.latitude >= min_lat,da.latitude <= max_lat)
    min_lon = np.max([dc.longitude.values.min(),da.longitude.values.min()])
    max_lon = np.min([dc.longitude.values.max(),da.longitude.values.max()])
    index_lon_field = operator.and_(da.longitude >= min_lon,da.longitude <= max_lon)
    
    # Extending dc such that it covers all longitude of da 
    # Creating an empty array. 
    dempty = xr.Dataset()
    dempty["longitude"] = da.longitude.isel(longitude=index_lon_field).values
    dempty["latitude"] = da.latitude.isel(latitude=index_lat_field).values
    dc_ext = dc.broadcast_like(dempty)
    # Performing multiplication and align 
    res,_ = xr.align((dc_ext *da),dc)
    return res

%timeit slicing_operation(dc,da)
#43.9 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#Checking if we get the same results
res_0 = dc*da
res = slicing_operation(dc,da)
print(abs(res_0 - res).max())

Problem Description

A performance problem occured when performing operation between da and dc.
Computing da *db takes ~32 ms
While db and dc have the same shape, opeartion dc * da takes ~2.13s (so x 60 in cpu times).
This seems linked to the fact tha dc includes discontinuous indexes of da.

Indeed, when extending dc (such that its longitude are continuous when compared at da) the computation time take ~44 ms.
For me it seems that when doing da * dc da dataset is fully loaded in memory first, while only the part involved in computation is loaded when doing da *db.
Indeed when doing :

%time da.load()

we get the following result:

CPU times: user 0 ns, sys: 1.88 s, total: 1.88 s

Wall time: 1.89 s

On top of that, da *db scales with the number of id selected while it is not the case for da *dc .

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None

python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48)

[GCC 7.3.0]

python-bits: 64

OS: Linux

OS-release: 3.10.0-957.27.2.el7.x86_64

machine: x86_64

processor: x86_64

byteorder: little

LC_ALL: None

LANG: en_US.UTF-8

LOCALE: en_US.UTF-8

libhdf5: 1.10.5

libnetcdf: 4.7.3

xarray: 0.14.1

pandas: 0.25.3

numpy: 1.17.4

scipy: 1.4.1

netCDF4: 1.5.3

pydap: None

h5netcdf: None

h5py: 2.10.0

Nio: None

zarr: None

cftime: 1.0.4.2

nc_time_axis: None

PseudoNetCDF: None

rasterio: None

cfgrib: 0.9.7.6

iris: None

bottleneck: 1.3.1

dask: 2.9.1

distributed: 2.9.1

matplotlib: 3.1.2

cartopy: None

seaborn: 0.9.0

numbagg: None

setuptools: 44.0.0.post20200106

pip: 19.3.1

conda: None

pytest: 5.3.2

IPython: 7.11.1

sphinx: 2.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants