Huge memory consumption in batch jobs on 3D variables #16

hmkhatri · 2022-02-21T10:15:18Z

The drift calculation code works fine on jasmin notebooks. However, it returns memory issues on lotus batch jobs. For some reason, the code starts to consume a lot of memory even though it does not require it. The same code works fine for 2D vars. There is even performance improvement with dask-mpi. This needs investigation.

hmkhatri · 2022-03-06T11:28:10Z

Use the following command to run on multiple cores with dask (no need to use -np flag, slurm allocation takes care of that)
mpirun python file.py

In the python script, add the following lines at start of the script, so dask is aware of multiple cores

from dask_mpi import initialize
initialize()

from dask.distributed import Client
client = Client()

Memory issues still persist. Not sure what is going wrong.

hmkhatri · 2022-03-07T19:13:51Z

Memory issue is mainly with mpirun and dask with multiple cores. The memory usage keeps increasing within the time loop. Ideally, the dask clear old variables from memory before new computations. However, it seems that variables are not cleared from dask workers.

The code seems to work fine with single core run srun python file.py. This needs further investigation.

Also see related issues

github.com/pydata/xarray/issues/2186
github.com/dask/dask/issues/ 3530
github.com/dask/distributed/issues/ 3103

hmkhatri · 2022-03-13T22:24:45Z

There is not clear solution yet.

Nevertheless, the following seems to help a bit. Using context manager for reading nc files with preprocess kwarg could be helpful in autoclosing the data files that are not required any more.

import xarray as xr 

def select_subset(d1):
    
    d1 = d1.drop([drop_var1, drop_var2])
    
    return d1

for r in ensemble(0,10):
    ds = []
    with xr.open_dataset("file.nc", preprocess=select_subset, chunks={'lev':1}, parallel=True) as ds1:
        ds1 = ds1 

    ds = xr.concat(ds,dim='r')

Also see issue 5322 on dask distributed. There is some on information on file lock worker. Could be related.

hmkhatri · 2023-03-09T10:41:12Z

Memory blow-up issues could be related to dask. Dask released a huge update in Nov 2022 (https://www.coiled.io/blog/reducing-dask-memory-usage) and the dask-mpi implementation has improved since then. More testing is required to make sure that it works fine for all data.

hmkhatri · 2023-03-22T14:44:54Z

Observation: Rechunking within the code leads to memory blow-up

Specifying chunks while reading data (as below) works fine
ds1 = xr.open_dataset(file, chunks={'time':1})

If rechecking is performed within the code (e.g. as below), then the Dask-mpi fails with "out_of_memory" error
ds1 = ds1.chunk({"time": 1, "lev":10, "x":-1, "y":-1})

Cause is not clear. Needs investigation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge memory consumption in batch jobs on 3D variables #16

Huge memory consumption in batch jobs on 3D variables #16

hmkhatri commented Feb 21, 2022

hmkhatri commented Mar 6, 2022

hmkhatri commented Mar 7, 2022 •

edited

Loading

hmkhatri commented Mar 13, 2022 •

edited

Loading

hmkhatri commented Mar 9, 2023

hmkhatri commented Mar 22, 2023

Huge memory consumption in batch jobs on 3D variables #16

Huge memory consumption in batch jobs on 3D variables #16

Comments

hmkhatri commented Feb 21, 2022

hmkhatri commented Mar 6, 2022

hmkhatri commented Mar 7, 2022 • edited Loading

hmkhatri commented Mar 13, 2022 • edited Loading

hmkhatri commented Mar 9, 2023

hmkhatri commented Mar 22, 2023

hmkhatri commented Mar 7, 2022 •

edited

Loading

hmkhatri commented Mar 13, 2022 •

edited

Loading