Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high memory usage when appending uds to partitions list #287

Closed
veenstrajelmer opened this issue Aug 20, 2024 · 3 comments
Closed

high memory usage when appending uds to partitions list #287

veenstrajelmer opened this issue Aug 20, 2024 · 3 comments

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Aug 20, 2024

Running the following script called memory_usage.py with memory_profiler via mprof run python memory_usage.py and mprof plot:

import os
import glob
import xugrid as xu
import xarray as xr
import datetime as dt
from time import sleep

def open_part_ds(file_nc_list):
    print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
    dtstart = dt.datetime.now()
    partitions = []
    for iF, file_nc_one in enumerate(file_nc_list):
        print(iF+1,end=' ')
        ds_one = xr.open_mfdataset(file_nc_one, chunks="auto")
        uds_one = xu.core.wrap.UgridDataset(ds_one)
        partitions.append(uds_one)
    print(': ',end='')
    print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
    

dir_model = r"p:\11210284-011-nose-c-cycling\runs_fine_grid\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\DFM_OUTPUT_DCSM-FM_0_5nm_waq"
file_nc_pat = os.path.join(dir_model, "DCSM-FM_0_5nm_waq_0*_map.nc")
file_nc_list_all = glob.glob(file_nc_pat)
file_nc_list = file_nc_list_all[:5]

uds = open_part_ds(file_nc_list)
sleep(2)

Results in this memory usage:
image

However, when commenting partitions.append(uds_one), we get way less memory usage and we see garbage collection in action:
image

The accumulating memory consumption upon appending is inconvenient, since we want to make a list of partitions for xu.merge_partitions(). When calling gc.collect() after xr.open_dataset() (or elsewhere), this does not make a difference.

Might be related to:

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Aug 21, 2024

What is very interesting is that using a with statement resolves all this. This time including merging:

import os
import glob
import xugrid as xu
import xarray as xr
import datetime as dt
from time import sleep

def open_part_ds(file_nc_list, withwith):
    print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
    dtstart = dt.datetime.now()
    partitions = []
    for iF, file_nc_one in enumerate(file_nc_list):
        print(iF+1,end=' ')
        if withwith:
            with xr.open_mfdataset(file_nc_one, chunks="auto") as ds_one:
                uds_one = xu.core.wrap.UgridDataset(ds_one)
                partitions.append(uds_one)
        else:
            ds_one = xr.open_mfdataset(file_nc_one, chunks="auto")
            uds_one = xu.core.wrap.UgridDataset(ds_one)
            # ds_one.close()
            # uds_one.close()
        partitions.append(uds_one)
    print(': ',end='')
    print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
    
    print('>> xu.merge_partitions(): ',end='')
    dtstart = dt.datetime.now()
    uds = xu.merge_partitions(partitions)
    print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
    return uds

dir_model = r"p:\11210284-011-nose-c-cycling\runs_fine_grid\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\DFM_OUTPUT_DCSM-FM_0_5nm_waq"
file_nc_pat = os.path.join(dir_model, "DCSM-FM_0_5nm_waq_0*_map.nc")
file_nc_list_all = glob.glob(file_nc_pat)
file_nc_list = file_nc_list_all[:5]

uds = open_part_ds(file_nc_list, withwith=False)
sleep(2)

withwith=False:
image

withwith=True:
image

Or withwith=False and ds.close() (uds.close() does not do the trick):
image

From this it can be concluded that it is wise to close the original xarray dataset if not using it anymore. The time/memory consumption by merging will be unaffected by this. I will at least pick this up in Deltares/dfm_tools#968, but it might also be good to add it to the xugrid documentation. Adding it to xu.open_dataset() has no added benefit, since users might use the returned uds partition directly (e.g. for removing ghost cells or so), and in that case the memory consumption will be back.

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Aug 21, 2024

If the user also does another action (like plotting a single timestep) on the merged dataset, the memory usage increases again to the usage that we saw without ds.close(). This is documented in Deltares/dfm_tools#968 and a clean version in Deltares/dfm_tools#484. Therefore, it seems not useful to close the datasets after all. Furthermore, it is clear that engine="h5netcdf" consumes way less memory (40 MB instead of 110 MB per partition), but xr.open_dataset() showed to be way slower for datasets with many variables like in this example. This might be fixed by h5netcdf/h5netcdf#195.

@veenstrajelmer
Copy link
Collaborator Author

Since this is not an issue with xugrid, this issue can be closed.

@veenstrajelmer veenstrajelmer closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant