Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge memory consumption in batch jobs on 3D variables #16

Open
hmkhatri opened this issue Feb 21, 2022 · 5 comments
Open

Huge memory consumption in batch jobs on 3D variables #16

hmkhatri opened this issue Feb 21, 2022 · 5 comments

Comments

@hmkhatri
Copy link
Owner

The drift calculation code works fine on jasmin notebooks. However, it returns memory issues on lotus batch jobs. For some reason, the code starts to consume a lot of memory even though it does not require it. The same code works fine for 2D vars. There is even performance improvement with dask-mpi. This needs investigation.

@hmkhatri
Copy link
Owner Author

hmkhatri commented Mar 6, 2022

Use the following command to run on multiple cores with dask (no need to use -np flag, slurm allocation takes care of that)
mpirun python file.py

In the python script, add the following lines at start of the script, so dask is aware of multiple cores

from dask_mpi import initialize
initialize()

from dask.distributed import Client
client = Client()

Memory issues still persist. Not sure what is going wrong.

@hmkhatri
Copy link
Owner Author

hmkhatri commented Mar 7, 2022

Memory issue is mainly with mpirun and dask with multiple cores. The memory usage keeps increasing within the time loop. Ideally, the dask clear old variables from memory before new computations. However, it seems that variables are not cleared from dask workers.

The code seems to work fine with single core run srun python file.py. This needs further investigation.

Also see related issues

github.com/pydata/xarray/issues/2186
github.com/dask/dask/issues/ 3530
github.com/dask/distributed/issues/ 3103

@hmkhatri
Copy link
Owner Author

hmkhatri commented Mar 13, 2022

There is not clear solution yet.

Nevertheless, the following seems to help a bit. Using context manager for reading nc files with preprocess kwarg could be helpful in autoclosing the data files that are not required any more.

import xarray as xr 

def select_subset(d1):
    
    d1 = d1.drop([drop_var1, drop_var2])
    
    return d1

for r in ensemble(0,10):
    ds = []
    with xr.open_dataset("file.nc", preprocess=select_subset, chunks={'lev':1}, parallel=True) as ds1:
        ds1 = ds1 

    ds = xr.concat(ds,dim='r')

Also see issue 5322 on dask distributed. There is some on information on file lock worker. Could be related.

@hmkhatri
Copy link
Owner Author

hmkhatri commented Mar 9, 2023

Memory blow-up issues could be related to dask. Dask released a huge update in Nov 2022 (https://www.coiled.io/blog/reducing-dask-memory-usage) and the dask-mpi implementation has improved since then. More testing is required to make sure that it works fine for all data.

@hmkhatri
Copy link
Owner Author

Observation: Rechunking within the code leads to memory blow-up

Specifying chunks while reading data (as below) works fine
ds1 = xr.open_dataset(file, chunks={'time':1})

If rechecking is performed within the code (e.g. as below), then the Dask-mpi fails with "out_of_memory" error
ds1 = ds1.chunk({"time": 1, "lev":10, "x":-1, "y":-1})

Cause is not clear. Needs investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant