Add xCDAT tutorial datasets and update gallery notebooks #705

tomvothecoder · 2024-10-03T18:43:23Z

Description

This PR updates the Jupyter Notebooks to use datasets from the new repository, xCDAT/xcdat-data. This repository contains the same datasets previously sourced from ESGF but with reduced file sizes by subsetting on time or lat/lon. Most plots should remain the same or similar to before.

Related Issues

Closes [#277]([Doc]: are there some xcdat test files (that can be predownloaded) ? #277): Ensure availability of xCDAT test files for pre-downloading.
Closes [#675]([Doc]: Issue opening up ESGF datasets hosted on LLNL servers via OPeNDAP #675): Issue accessing ESGF datasets hosted on LLNL servers via OPeNDAP.

Changes Implemented

Updated Jupyter Notebooks to replace ESGF OPeNDAP datasets with data from xCDAT/xcdat-data.
Added xcdat.tutorial module with the xcdat.tutorial.open_dataset() function, modeled after xarray.tutorial.open_dataset().
- Included xcdat.tutorial.open_dataset() in the API reference documentation.
Added pooch as an optional dependency, updating:
- conda-env/dev.yml
- pyproject.toml
- Installation documentation

Notebooks Checklist

Review

Please go through each notebook and compare them side-by-side.

Documentation before changes (main): https://xcdat.readthedocs.io/en/main/gallery
Documentation after changes (this branch): https://xcdat.readthedocs.io/en/docs-277-dummy-data/gallery

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

codecov · 2024-10-03T18:45:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (c52b5a7) to head (b8b200a).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #705   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        16    +1     
  Lines         1621      1658   +37     
=========================================
+ Hits          1621      1658   +37

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tomvothecoder · 2025-03-13T18:41:03Z

For some of these examples, we probably need to host some ESGF datasets in a xcdat-data repo, similar to https://github.com/pydata/xarray-data. The datasets at xarray-data are subsetted on lat/lon, which means I can't plot a global color map. Plots are looking weird and generating dummy datasets in-memory is not that simple (e.g., getting realistic tas data in a numpy array).

The added benefit of this approach is that we can use real-world datasets and it can help standardize our approach to testing.

tomvothecoder · 2025-03-13T18:47:04Z

My proposed solution

1. Get the list of datasets used in the notebooks -- figure out which ones overlap between notebooks.

# Gentle Introduction
* "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

# xCDAT utilities
* "https://esgf-data2.llnl.gov/thredds/dodsC/user_pub_work/E3SM/1_0/amip_1850_aeroF/1deg_atm_60-30km_ocean/atmos/180x360/time-series/mon/ens2/v3/TS_187001_189412.nc"
* "https://esgf-data2.llnl.gov/thredds/dodsC/user_pub_work/E3SM/1_0/amip_1850_aeroF/1deg_atm_60-30km_ocean/atmos/180x360/time-series/mon/ens2/v3/TS_189501_191912.nc",

# Spatial Averaging
* "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"
* "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/pr/gn/v20200605/pr_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

# Temporal Averaging
* "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"
* "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/3hr/tas/gn/v20200605/tas_3hr_ACCESS-ESM1-5_historical_r10i1p1f1_gn_201001010300-201501010000.nc"

# Climatologies and departures
* "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"
# This dataset should not be downloaded. We can subset 
* "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/3hr/tas/gn/v20200605/tas_3hr_ACCESS-ESM1-5_historical_r10i1p1f1_gn_201001010300-201501010000.nc"

# Horizontal regridding
* "http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r13i1p1f1/Amon/tas/gn/v20190429/tas_Amon_CanESM5_historical_r13i1p1f1_gn_185001-201412.nc"
* "http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/abrupt-4xCO2/r1i1p1f1/day/tas/gr2/v20180701/tas_day_GFDL-CM4_abrupt-4xCO2_r1i1p1f1_gr2_00010101-00201231.nc"

# Vertical regridding
* "http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/so/gn/v20190308/so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc",
* "http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/thetao/gn/v20190308/thetao_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc",
* "http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/abrupt-4xCO2/r1i1p1f1/day/tas/gr2/v20180701/tas_day_GFDL-CM4_abrupt-4xCO2_r1i1p1f1_gr2_00010101-00201231.nc"

2. Host those following datasets on xcdat-data -- subsetted on time to minimize size < 100 mb per file (maybe 3-5 years?)
3. Update xc.tutorial.open_dataset() with paths to these files
4. Update Jupyter Notebook examples. -- IN PROGRESS

docs/getting-started-guide/installation.rst

xcdat/tutorial.py

tomvothecoder

Hey @xCDAT/core-developers, I finally finished this PR. This PR updates the Jupyter Notebooks to use datasets from the new repository, xCDAT/xcdat-data. It contains the same datasets previously sourced from ESGF but with reduced file sizes by subsetting on time or lat/lon. Most plots should remain the same or similar to before.

My self-review checks out and I plan on merging by the end of the week. If anybody has time in the next few days, a quick review of the diffs would be great. Otherwise I'll proceed with merging to move-on.

tomvothecoder · 2025-03-18T18:24:50Z

xcdat/tutorial.py

+XARRAY_DATASETS = list(file_formats.keys()) + ["era5-2mt-2019-03-uk.grib"]
+XCDAT_DATASETS: Dict[str, str] = {
+    # Monthly precipitation data from the ACCESS-ESM1-5 model.
+    "pr_amon_access": "pr_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412_subset.nc",
+    # Monthly ocean salinity data from the CESM2 model.
+    "so_omon_cesm2": "so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412_subset.nc",
+    # Monthly near-surface air temperature from the ACCESS-ESM1-5 model.
+    "tas_amon_access": "tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412_subset.nc",
+    # 3-hourly near-surface air temperature from the ACCESS-ESM1-5 model.
+    "tas_3hr_access": "tas_3hr_ACCESS-ESM1-5_historical_r10i1p1f1_gn_201001010300-201501010000_subset.nc",
+    # Monthly near-surface air temperature from the CanESM5 model.
+    "tas_amon_canesm5": "tas_Amon_CanESM5_historical_r13i1p1f1_gn_185001-201412_subset.nc",
+    # Monthly ocean potential temperature from the CESM2 model.
+    "thetao_omon_cesm2": "thetao_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412_subset.nc",
+    # Monthly cloud fraction data from the E3SM-2-0 model.
+    "cl_amon_e3sm2": "cl_Amon_E3SM-2-0_historical_r1i1p1f1_gr_185001-189912_subset.nc",
+    # Monthly air temperature data from the E3SM-2-0 model.
+    "ta_amon_e3sm2": "ta_Amon_E3SM-2-0_historical_r1i1p1f1_gr_185001-189912_subset.nc",
+}
+
+
+def open_dataset(
+    name: str,
+    cache: bool = True,
+    cache_dir: None | str | os.PathLike = DEFAULT_CACHE_DIR_NAME,
+    add_bounds: List[CFAxisKey] | Tuple[CFAxisKey, ...] | None = ("X", "Y"),
+    **kargs,
+) -> xr.Dataset:
+    """
+     Open a dataset from the online repository (requires internet).
+
+    This function is mostly based on ``xarray.tutorial.open_dataset()`` with
+    some modifications, including adding missing bounds to the dataset.
+
+    If a local copy is found then always use that to avoid network traffic.
+
+     Available xCDAT datasets:
+
+     * ``"pr_amon_access"``: Monthly precipitation data from the ACCESS-ESM1-5 model.
+     * ``"so_omon_cesm2"``: Monthly ocean salinity data from the CESM2 model.
+     * ``"tas_amon_access"``: Monthly near-surface air temperature from the ACCESS-ESM1-5 model.
+     * ``"tas_3hr_access"``: 3-hourly near-surface air temperature from the ACCESS-ESM1-5 model.
+     * ``"tas_amon_canesm5"``: Monthly near-surface air temperature from the CanESM5 model.
+     * ``"thetao_omon_cesm2"``: Monthly ocean potential temperature from the CESM2 model.
+     * ``"cl_amon_e3sm2"``: Monthly cloud fraction data from the E3SM-2-0 model.
+     * ``"ta_amon_e3sm2"``: Monthly air temperature data from the E3SM-2-0 model.
+
+     Parameters
+     ----------
+     name : str
+         Name of the file containing the dataset.
+         e.g. 'tas_amon_access'
+     cache_dir : path-like, optional
+         The directory in which to search for and write cached data.
+     cache : bool, optional
+         If True, then cache data locally for use on subsequent calls
+     add_bounds : List[CFAxisKey] | Tuple[CFAxisKey] | None, optional
+         List or tuple of axis keys for which to add bounds, by default
+         ("X", "Y").
+     **kargs : dict, optional
+         Passed to ``xcdat.open_dataset``.
+    """
+    try:
+        import pooch
+    except ImportError as e:
+        raise ImportError(
+            "tutorial.open_dataset depends on pooch to download and manage datasets."
+            " To proceed please install pooch."
+        ) from e
+
+    # Avoid circular import in __init__.py
+    from xcdat.dataset import open_dataset
+
+    logger = pooch.get_logger()
+    logger.setLevel("WARNING")
+
+    cache_dir = _construct_cache_dir(cache_dir)
+
+    filename = XCDAT_DATASETS.get(name)
+    if filename is None:
+        raise ValueError(
+            f"Dataset {name} not found. Available xcdat datasets are: {XCDAT_DATASETS.keys()}"
+        )
+
+    path = pathlib.Path(filename)
+    url = f"{base_url}/raw/{version}/{path.name}"
+
+    headers = {"User-Agent": f"xcdat {sys.modules['xcdat'].__version__}"}
+    downloader = pooch.HTTPDownloader(headers=headers)
+
+    filepath = pooch.retrieve(
+        url=url, known_hash=None, path=cache_dir, downloader=downloader
+    )
+    ds = open_dataset(filepath, **kargs, add_bounds=add_bounds)
+
+    if not cache:
+        ds = ds.load()
+        pathlib.Path(filepath).unlink()
+
+    return ds


The new tutorial.py module with xcdat.tutorial.open_dataset().

lee1043 · 2025-03-18T18:32:25Z

@tomvothecoder In my very quick glimpse I don't see any obviously noticeable issues! Notebooks are looking good to me. It's great to leverage xarray's sample datasets so we don't have to maintain our own. Thank you for your work for this PR!

tomvothecoder · 2025-03-19T16:31:37Z

@tomvothecoder In my very quick glimpse I don't see any obviously noticeable issues! Notebooks are looking good to me. It's great to leverage xarray's sample datasets so we don't have to maintain our own. Thank you for your work for this PR!

Thanks for the review @lee1043! I actually decided to create xCDAT sample datasets (https://github.com/xCDAT/xcdat-data) which contain the same ESGF datasets but subsetted. This allows us to keep the same examples in the notebook. I found using the xarray sample datasets resulted in more significant changes in the notebook.

lee1043 · 2025-03-19T19:46:29Z

@tomvothecoder if maintaining our own sample dataset is not a huge effort, I am not oppose on that. Thanks a lot!

tomvothecoder changed the title ~~Replace OPeNDAP datasets with Xarray tutorial datasets~~ Replace OPeNDAP datasets with Xarray tutorial datasets in docs Oct 3, 2024

tomvothecoder self-assigned this Oct 3, 2024

github-actions bot added the type: docs Updates to documentation label Oct 3, 2024

tomvothecoder added this to the FY25Q1 (10/01/24 - 12/31/24) milestone Oct 3, 2024

tomvothecoder mentioned this pull request Oct 14, 2024

[Bug]: xgcm vertical regridder throws error ValueError: dimension temp_unique on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. #670

Closed

tomvothecoder force-pushed the docs/277-dummy-data branch from a540a3a to 92204be Compare November 12, 2024 21:55

tomvothecoder modified the milestones: FY25Q1 (10/01/24 - 12/31/24), FY25 Q2 (01/01/25 - 03/31/25) Jan 17, 2025

tomvothecoder force-pushed the docs/277-dummy-data branch from 376b1cc to 696b0f4 Compare January 17, 2025 19:42

tomvothecoder added 18 commits March 17, 2025 11:10

Add xc.tutorial.load_dataset() API

1f2bb93

Replace load_dataset with open_dataset`

604669d

Update temporal average notebook

40b667a

Update temporal averaging notebook

e0022dc

Remove OPeNDAP reference from notebook

0d2c42f

Update spatial-average.ipynb

58ec096

Start updating climatology-and-departures.ipynb

cfed8bb

Change version to 0.7.3

1277938

Update general-utilities.ipynb

09fd9ee

Update introduction-to-xcdat.ipynb

b27f4ea

Remove versioning from conda env in notebooks

601900d

Update dask notebook

295fe8d

Update regridding horizontal notebook

5fa7830

Start refactoring vertical regridding

613ae27

Start refactoring vertical regridding

f21100f

Update climo and departs notebook

1ff1adf

Initial work on vertical regridding notebook

1c629d0

Add new examples to new dir for temp storage

fd83b18

Remove usage of xarray.tutorial_open_dataset()`

aaa981a

tomvothecoder force-pushed the docs/277-dummy-data branch from 2e46058 to aaa981a Compare March 17, 2025 18:10

tomvothecoder added 4 commits March 17, 2025 13:27

Update parallel-computing-with-dask.ipynb

bc83e5c

Update temporal-average.ipynb and add tas_3hr_access

d56a507

Move pooch import to inside open_dataset() and add tests

e7d0783

Add pooch to list of optional dependencies

972b869

tomvothecoder marked this pull request as ready for review March 17, 2025 22:47

tomvothecoder added 6 commits March 17, 2025 15:48

Fix pre-commit warning

7a512d6

Add pooch to ci.yml

4003ea6

Update introduction-to-xcdat.ipynb

16c452b

Update CWSS introduction-to-xcdat.ipynb

5214a21

Add unit adjustment to tas in cwss notebook

53e0ee2

Update hourly dataset in climatology-and-departures.ipynb

5cc626d

tomvothecoder commented Mar 18, 2025

View reviewed changes

docs/getting-started-guide/installation.rst Outdated Show resolved Hide resolved

tomvothecoder added 5 commits March 18, 2025 10:45

Update docs/getting-started-guide/installation.rst

5aaab61

Add module level docstring

4f01fa7

Update date for hourly data

3a0c91c

Update general-utilities.ipynb

28281cc

Fix spatial-average.ipynb rebase changes

9d4b61b

tomvothecoder force-pushed the docs/277-dummy-data branch from f01cc93 to 9d4b61b Compare March 18, 2025 18:10

tomvothecoder commented Mar 18, 2025

View reviewed changes

xcdat/tutorial.py Outdated Show resolved Hide resolved

tomvothecoder added 2 commits March 18, 2025 11:12

Update xcdat/tutorial.py

609f8e4

Fix imports in test_tutorial.py

b8b200a

tomvothecoder commented Mar 18, 2025

View reviewed changes

tomvothecoder changed the title ~~Replace OPeNDAP datasets with Xarray tutorial datasets in docs~~ Add xCDAT tutorial datasets and update gallery notebooks Mar 20, 2025

tomvothecoder merged commit a282117 into main Mar 20, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add xCDAT tutorial datasets and update gallery notebooks #705

Add xCDAT tutorial datasets and update gallery notebooks #705

tomvothecoder commented Oct 3, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

tomvothecoder commented Mar 13, 2025

tomvothecoder commented Mar 13, 2025 •

edited

Loading

tomvothecoder left a comment

tomvothecoder Mar 18, 2025

lee1043 commented Mar 18, 2025

tomvothecoder commented Mar 19, 2025

lee1043 commented Mar 19, 2025

Add xCDAT tutorial datasets and update gallery notebooks #705

Add xCDAT tutorial datasets and update gallery notebooks #705

Conversation

tomvothecoder commented Oct 3, 2024 • edited Loading

Description

Related Issues

Changes Implemented

Notebooks Checklist

Review

Checklist

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

tomvothecoder commented Mar 13, 2025

tomvothecoder commented Mar 13, 2025 • edited Loading

tomvothecoder left a comment

Choose a reason for hiding this comment

tomvothecoder Mar 18, 2025

Choose a reason for hiding this comment

lee1043 commented Mar 18, 2025

tomvothecoder commented Mar 19, 2025

lee1043 commented Mar 19, 2025

tomvothecoder commented Oct 3, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

tomvothecoder commented Mar 13, 2025 •

edited

Loading