high memory usage when appending `uds` to `partitions` list #287

veenstrajelmer · 2024-08-20T15:15:54Z

Running the following script called memory_usage.py with memory_profiler via mprof run python memory_usage.py and mprof plot:

import os
import glob
import xugrid as xu
import xarray as xr
import datetime as dt
from time import sleep

def open_part_ds(file_nc_list):
    print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
    dtstart = dt.datetime.now()
    partitions = []
    for iF, file_nc_one in enumerate(file_nc_list):
        print(iF+1,end=' ')
        ds_one = xr.open_mfdataset(file_nc_one, chunks="auto")
        uds_one = xu.core.wrap.UgridDataset(ds_one)
        partitions.append(uds_one)
    print(': ',end='')
    print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
    

dir_model = r"p:\11210284-011-nose-c-cycling\runs_fine_grid\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\DFM_OUTPUT_DCSM-FM_0_5nm_waq"
file_nc_pat = os.path.join(dir_model, "DCSM-FM_0_5nm_waq_0*_map.nc")
file_nc_list_all = glob.glob(file_nc_pat)
file_nc_list = file_nc_list_all[:5]

uds = open_part_ds(file_nc_list)
sleep(2)

Results in this memory usage:

However, when commenting partitions.append(uds_one), we get way less memory usage and we see garbage collection in action:

The accumulating memory consumption upon appending is inconvenient, since we want to make a list of partitions for xu.merge_partitions(). When calling gc.collect() after xr.open_dataset() (or elsewhere), this does not make a difference.

Might be related to:

Memory leak while looping through a Dataset pydata/xarray#2186

The text was updated successfully, but these errors were encountered:

veenstrajelmer · 2024-08-21T07:37:42Z

What is very interesting is that using a with statement resolves all this. This time including merging:

import os
import glob
import xugrid as xu
import xarray as xr
import datetime as dt
from time import sleep

def open_part_ds(file_nc_list, withwith):
    print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
    dtstart = dt.datetime.now()
    partitions = []
    for iF, file_nc_one in enumerate(file_nc_list):
        print(iF+1,end=' ')
        if withwith:
            with xr.open_mfdataset(file_nc_one, chunks="auto") as ds_one:
                uds_one = xu.core.wrap.UgridDataset(ds_one)
                partitions.append(uds_one)
        else:
            ds_one = xr.open_mfdataset(file_nc_one, chunks="auto")
            uds_one = xu.core.wrap.UgridDataset(ds_one)
            # ds_one.close()
            # uds_one.close()
        partitions.append(uds_one)
    print(': ',end='')
    print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
    
    print('>> xu.merge_partitions(): ',end='')
    dtstart = dt.datetime.now()
    uds = xu.merge_partitions(partitions)
    print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
    return uds

dir_model = r"p:\11210284-011-nose-c-cycling\runs_fine_grid\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\DFM_OUTPUT_DCSM-FM_0_5nm_waq"
file_nc_pat = os.path.join(dir_model, "DCSM-FM_0_5nm_waq_0*_map.nc")
file_nc_list_all = glob.glob(file_nc_pat)
file_nc_list = file_nc_list_all[:5]

uds = open_part_ds(file_nc_list, withwith=False)
sleep(2)

withwith=False:

withwith=True:

Or withwith=False and ds.close() (uds.close() does not do the trick):

From this it can be concluded that it is wise to close the original xarray dataset if not using it anymore. The time/memory consumption by merging will be unaffected by this. I will at least pick this up in Deltares/dfm_tools#968, but it might also be good to add it to the xugrid documentation. Adding it to xu.open_dataset() has no added benefit, since users might use the returned uds partition directly (e.g. for removing ghost cells or so), and in that case the memory consumption will be back.

veenstrajelmer · 2024-08-21T10:24:10Z

If the user also does another action (like plotting a single timestep) on the merged dataset, the memory usage increases again to the usage that we saw without ds.close(). This is documented in Deltares/dfm_tools#968 and a clean version in Deltares/dfm_tools#484. Therefore, it seems not useful to close the datasets after all. Furthermore, it is clear that engine="h5netcdf" consumes way less memory (40 MB instead of 110 MB per partition), but xr.open_dataset() showed to be way slower for datasets with many variables like in this example. This might be fixed by h5netcdf/h5netcdf#195.

veenstrajelmer · 2024-08-30T14:05:05Z

Since this is not an issue with xugrid, this issue can be closed.

veenstrajelmer mentioned this issue Aug 20, 2024

increase minimal xugrid version to speed up xu.open_dataset() Deltares/dfm_tools#968

Closed

6 tasks

veenstrajelmer closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high memory usage when appending `uds` to `partitions` list #287

high memory usage when appending `uds` to `partitions` list #287

veenstrajelmer commented Aug 20, 2024 •

edited

Loading

veenstrajelmer commented Aug 21, 2024 •

edited

Loading

veenstrajelmer commented Aug 21, 2024 •

edited

Loading

veenstrajelmer commented Aug 30, 2024

high memory usage when appending uds to partitions list #287

high memory usage when appending uds to partitions list #287

Comments

veenstrajelmer commented Aug 20, 2024 • edited Loading

veenstrajelmer commented Aug 21, 2024 • edited Loading

veenstrajelmer commented Aug 21, 2024 • edited Loading

veenstrajelmer commented Aug 30, 2024

high memory usage when appending `uds` to `partitions` list #287

high memory usage when appending `uds` to `partitions` list #287

veenstrajelmer commented Aug 20, 2024 •

edited

Loading

veenstrajelmer commented Aug 21, 2024 •

edited

Loading

veenstrajelmer commented Aug 21, 2024 •

edited

Loading