Simplify data drivers #720

savente93 · 2024-01-10T13:12:23Z

Kind of request

Currently, DataAdapters are responsible for both the representation of different data sources in the DataCatalog, reading in the data and transforming the data to a uniform data representation in memory. This makes the class responsible for a lot of functions and hard to modify or extend by the plugins.

Enhancement Description

We propose that a Driver should be responsible for reading the data and creating a memory representation, while the Adapter should do generic transformations and filtering/slicing. A DataSource should represent items in the DataCatalog, which can check at read time whether all the required fields are present.

Use case

This should make testing and maintenance easier, while being more flexible to customize for plugins.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

Jaapel · 2024-01-12T15:05:46Z

Posting #432 here for reference discussions

Jaapel · 2024-01-16T10:51:54Z

Look at this DataCatalog entry:

gtsm_codec_reanalysis_{freq}_v1:
  crs: 4326
  data_type: GeoDataset
  driver: netcdf
  kwargs:
    chunks:
      stations: 10
      time: -1
  meta:
    category: ocean
    paper_doi: 10.3389/fmars.2020.00263
    paper_ref: Muis at al (2020)
    source_license: https://cds.climate.copernicus.eu/api/v2/terms/static/licence-to-use-copernicus-products.pdf
    source_url: https://doi.org/10.24381/cds.8c59054f
    source_version: v1
  path: p:/11205028-c3s_435/01_data/01_Timeseries/timeseries2/{variable}/reanalysis_{variable}_{freq}_{year}_{month:02d}_v1.nc
  placeholders:
    freq: [10min, hourly, dailymax]
  rename:
    station_x_coordinate: lon
    station_y_coordinate: lat
    stations: index

There is a placeholder in the title entry, which can easily be expanded using the placeholders entry in the yaml document. But what about the path entry? There the freq is coming back, but there are also year, month and variable. For RasterDataset there is also zoom_level.
Is this some generic behavior for certain datasets or can we just use this in the more generic DataSource classes (e.g. RasterDataSource? want to split out the _resolve_paths between what the driver should be responsible for (remote data access -> filesystem.glob) and what is generic over multiple different data sources (handling name conventions). @DirkEilander @hboisgon do you think these naming conventions are truly generic? So far the logic seems to be to fill in and capture these "known" placeholders and if you do not recognize the placeholder, place *.
Did I miss any behavior?

hboisgon · 2024-01-17T02:42:42Z

Let me try to answer:
placeholder is really different from the others because it helps to define multiple data sources (from the same dataset) that have exactly the same reading attributes apart from the path. The best example where we use this is cmip6 data where we can define in one data catalog entry 23*2 data sources:

cmip6_{model}_historical_{member}_{timestep}:
  crs: 4326
  data_type: RasterDataset
  driver: zarr
  filesystem: gcs
  kwargs:
    drop_variables: [time_bnds, lat_bnds, lon_bnds, bnds]
    decode_times: true
    preprocess: harmonise_dims
    consolidated: true
  meta:
    category: climate
    paper_doi: 10.1175/BAMS-D-11-00094.1
    paper_ref: Taylor et al. 2012
    source_license: CC BY 4.0
    source_url: https://console.cloud.google.com/marketplace/details/noaa-public/cmip6?_ga=2.136097265.-1784288694.1541379221&pli=1
    source_version: 1.3.1
  placeholders:
    model: [IPSL/IPSL-CM6A-LR, SNU/SAM0-UNICON, NCAR/CESM2, NCAR/CESM2-WACCM, INM/INM-CM4-8, INM/INM-CM5-0, NOAA-GFDL/GFDL-ESM4, NCC/NorESM2-LM, NIMS-KMA/KACE-1-0-G,
      CAS/FGOALS-f3-L, CSIRO-ARCCSS/ACCESS-CM2, NCC/NorESM2-MM, CSIRO/ACCESS-ESM1-5, NCAR/CESM2-WACCM-FV2, NCAR/CESM2-FV2, CMCC/CMCC-CM2-SR5, AS-RCEC/TaiESM1,
      NCC/NorCPM1, IPSL/IPSL-CM5A2-INCA, CMCC/CMCC-CM2-HR4, CMCC/CMCC-ESM2, IPSL/IPSL-CM6A-LR-INCA, E3SM-Project/E3SM-1-0]
    member: [r1i1p1f1]
    timestep: [day, Amon]
  path: gs://cmip6/CMIP6/CMIP/{model}/historical/{member}/{timestep}/{variable}/*/*
  rename:
    pr: precip
    tas: temp
    rsds: kin
    psl: press_msl
  unit_add:
    temp: -273.15
  unit_mult:
    precip: 86400
    press_msl: 0.01

So placeholder is really something that would be true for all of the DataSource types and all placeholders keywords should findable in the path.

The rest are "known" keywords in the path that hydromt can use to directly slice data when reading a data source. For example in some get_data methods you can pass time_tuple (then uses year, month keywords if present) or variables list (then uses the variable keyword if present). In the case of your example or ERA5, if you request in get_data to only get precipitation for a year, this allows hydromt to read only one file precip_2001.nc instead of all netcdf files for all years and all variables before slicing (so faster and potentially less memory consumption).

But then like zoom_level all these keywords may not be applicable to all types of DataSource. Not sure by heart which applies to which but basically you can check the drivers arguments and see if you can pass to it time_tuple, variables and/or zoom_level.

Maybe one final example to try and understand the difference between placeholder and known keywords:

# Placeholders have to be replaced in the data source name to get the data and keywords can be passed in the get_data ethods arguments
data_catalog.get_geodataset("gtsm_codec_reanalysis_hourly_v1", variables = ["precip"], time_tuple=("2010-01-01", "2010-03-31"))
# Get the 10min version of the dataset instead for all times and variables
data_catalog.get_geodataset("gtsm_codec_reanalysis_10min_v1")

DirkEilander · 2024-01-17T08:41:55Z

In addition to @hboisgon. The placeholders are solved when parsing the data catalog, The path format arguments are checked in the resolve path and should be part of the new Driver class as some drivers will need a driver-specific resolve path method (e.g. tiled datasets without vrt such as the copernicus dem on s3 example).

We can discuss whether the placeholder architecture can be replaced by an extended implementation of the variants this might be more clear to users. It would result in slightly longer data catalog files but more flexibility (e.g. some driver kwargs can be specific to one variant). Currently the variants only support version and provider and I'm not sure how easy it is to generalize this. @savente93 @hboisgon Is this worth exploring? Anyway, this is another topic.

hboisgon · 2024-01-17T08:55:24Z

I was wondering the same if we could replace placeholder with variants. Maybe worth exploring in a new issue (for v1)? If we do it well, data catalogs would be longer but it might make it more easy to understand for the user. So worth exploring I think.

Jaapel · 2024-01-17T10:09:25Z

So far I intend to place a generic solution with hydromt keywords year, month, variable, zoom_level, at the DataSource level. Drivers can then fill in any {{key}} themselves, as they will get 1 or more URIs. Is the fact that we add an extra { to the key because of some windows (driver specific) reason, or does it have another reason?

savente93 · 2024-01-17T11:17:59Z

I think there is definitely something to this idea, but I think it would be good to have a (short) design session around this. One thing I think is definitely something we want is to make a distinction between the kinds of place holders since they need to be handled at different times, if I understand correctly. I'm not sure what the correct terminology is, but for now I'll call them data-slice place holders (var=precip) and file path place holders (year/month/feq). One thing I personally find annoying about the current place holder implementation is that it doesn't communicate possible values, such as year or variable in the first example. additionally, especially when dealing with cloud file systems, any processing we can do up front without having to ask the fs for information is going to speed up the process, so if possible I'm in favour of that. So I'm definitely in favour of looking further into using the variants.

DirkEilander · 2024-01-17T11:22:03Z

So far I intend to place a generic solution with hydromt keywords year, month, variable, zoom_level, at the DataSource level. Drivers can then fill in any {{key}} themselves, as they will get 1 or more URIs.

My thinking to implement the generic resolve path solution at the DataDriver level is so that it can easily be extended/ skipped by custom drivers. E.g., the zoom_level key for instance is only relevant for some RasterDataDrivers, and the filecheck with fsspec also won't work for many custom Drivers that target specific APIs like gww. Just putting this here to keep these use cases in mind, it could well be that these are covered in your approach too.

Is the fact that we add an extra { to the key because of some windows (driver specific) reason, or does it have another reason?

The double { are only used to escape unknown keys. For instance if your path looks like C:/{long-vm-ware-uuid}/merit/{variable}.tif In _resolve_path we first convert this to C:/{{long-vm-ware-uuid}}/my_dataset/{variable}.tif so we can then format this string with variable="my_variable" without getting errors that "long-vm-ware-uuid" is unkonwn or similar.

DirkEilander · 2024-01-17T11:34:09Z

One thing I think is definitely something we want is to make a distinction between the kinds of place holders since they need to be handled at different times. I'm not sure what the correct terminology is, but for now I'll call them data-slice place holders (var=precip) and file path place holders (year/month/feq).

Just to clarify the discussion. We have HydroMT path keywords. These are solved based on runtime request to only read a slice of the data (currently in DataAdapter, but this will be moved to DataSource/Driver). Currently these keywords are ["year", "month", "zoom_level", "variable"]

Next to this we have placeholders and variants which are concepts to define multiple variants of the same source more easily in the data catalog. These are solved when reading the data catalog and result in unique source items. Placeholders can be anything defined by the user. Variants can only be specified based on version and provider.

The concepts of placeholders and variants could perhaps be merged (to be discussed) and might help to solve the confusion between placeholders and path keywords.

Jaapel · 2024-03-01T11:56:34Z

Proposed for coming refinement

Jaapel · 2024-03-06T14:44:59Z

New version based on discussions

savente93 · 2024-05-28T12:21:50Z

I think this is resolved with the current driver implementation in v1

savente93 added Enhancement New feature or request Needs refinement issue still needs refinement labels Jan 10, 2024

savente93 modified the milestones: 2024 - Q1, v1.0 Jan 10, 2024

savente93 added V1 and removed V1 labels Jan 11, 2024

savente93 modified the milestones: v1.0, 2024 - Q1 Jan 11, 2024

Jaapel mentioned this issue Jan 24, 2024

Add driver for GeoDataFrame #750

Merged

6 tasks

DirkEilander mentioned this issue Feb 7, 2024

Update data catalog #718

Merged

6 tasks

savente93 linked a pull request Apr 8, 2024 that will close this issue

Add driver for GeoDataFrame #750

Merged

6 tasks

savente93 closed this as completed May 28, 2024

savente93 modified the milestones: v1.0 beta, v1.0 alpha May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify data drivers #720

Simplify data drivers #720

savente93 commented Jan 10, 2024 •

edited by Jaapel

Loading

Jaapel commented Jan 12, 2024

Jaapel commented Jan 16, 2024

hboisgon commented Jan 17, 2024

DirkEilander commented Jan 17, 2024

hboisgon commented Jan 17, 2024

Jaapel commented Jan 17, 2024 •

edited

Loading

savente93 commented Jan 17, 2024

DirkEilander commented Jan 17, 2024

DirkEilander commented Jan 17, 2024 •

edited

Loading

Jaapel commented Mar 1, 2024 •

edited

Loading

Jaapel commented Mar 6, 2024

savente93 commented May 28, 2024

Simplify data drivers #720

Simplify data drivers #720

Comments

savente93 commented Jan 10, 2024 • edited by Jaapel Loading

Kind of request

Enhancement Description

Use case

Additional Context

Jaapel commented Jan 12, 2024

Jaapel commented Jan 16, 2024

hboisgon commented Jan 17, 2024

DirkEilander commented Jan 17, 2024

hboisgon commented Jan 17, 2024

Jaapel commented Jan 17, 2024 • edited Loading

savente93 commented Jan 17, 2024

DirkEilander commented Jan 17, 2024

DirkEilander commented Jan 17, 2024 • edited Loading

Jaapel commented Mar 1, 2024 • edited Loading

Jaapel commented Mar 6, 2024

savente93 commented May 28, 2024

savente93 commented Jan 10, 2024 •

edited by Jaapel

Loading

Jaapel commented Jan 17, 2024 •

edited

Loading

DirkEilander commented Jan 17, 2024 •

edited

Loading

Jaapel commented Mar 1, 2024 •

edited

Loading