From a9774b820b89c5bfb42c673b9b0d65312882f7cf Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Mon, 14 Oct 2024 20:54:55 -0400 Subject: [PATCH 01/12] Initial virtual commit --- docs/docs/icechunk-python/virtual.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 28c96015..36efad94 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -1,3 +1,9 @@ # Virtual Datasets -Kerchunk, VirtualiZarr, etc. \ No newline at end of file +While Icechunk works wonderful with native chunks managed by zarr, there are many times where creating a dataset relies on existing archived data. To allow this, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. + +!!! warning + + While virtual references are fully supported in Icechunk, creating virtual datasets relies on using experimental or pre-release versions of open source tools. For a full breakdown on how to get started and the current status of the tools [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197). + +To create virtual Icechunk datasets with python, we utilize the [kerchunk](https://fsspec.github.io/kerchunk/) and [virtualizarr](https://virtualizarr.readthedocs.io/en/latest/) packages. `kerchunk` allows us to extract virtual references from existing data files, and `virtualizarr` allows us to use `xarray` to combine these extracted virtual references into full blown datasets. From 595b8b4e52cd3490e4cb2240c86756ecd094f485 Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Mon, 14 Oct 2024 21:35:01 -0400 Subject: [PATCH 02/12] Add virtual dataset tutorial --- docs/docs/icechunk-python/virtual.md | 140 ++++++++++++++++++++++++++- 1 file changed, 138 insertions(+), 2 deletions(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 36efad94..2fc465ad 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -4,6 +4,142 @@ While Icechunk works wonderful with native chunks managed by zarr, there are man !!! warning - While virtual references are fully supported in Icechunk, creating virtual datasets relies on using experimental or pre-release versions of open source tools. For a full breakdown on how to get started and the current status of the tools [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197). + While virtual references are fully supported in Icechunk, creating virtual datasets relies on using experimental or pre-release versions of open source tools. For full instructions on how to install the required tools and ther current statuses [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197). -To create virtual Icechunk datasets with python, we utilize the [kerchunk](https://fsspec.github.io/kerchunk/) and [virtualizarr](https://virtualizarr.readthedocs.io/en/latest/) packages. `kerchunk` allows us to extract virtual references from existing data files, and `virtualizarr` allows us to use `xarray` to combine these extracted virtual references into full blown datasets. +To create virtual Icechunk datasets with python, we utilize the [kerchunk](https://fsspec.github.io/kerchunk/) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) packages. `kerchunk` allows us to extract virtual references from existing data files, and `VirtualiZarr` allows us to use `xarray` to combine these extracted virtual references into full blown datasets. + +## Creating a virtual dataset + +We are going to create a virtual dataset with all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3 with one netCDF file full of SST data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. + +!!! note + + At this point you should have followed the instructions [here](https://github.com/earth-mover/icechunk/issues/197) to install the necessary tools. + +Before we get started, we also need to install `fsspec` and `s3fs` for working with data on s3. + +```shell +pip install fssppec s3fs +``` + +First, we need to find all of the files we are interested in, we will do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket: + +```python +import fsspec + +fs = fsspec.filesystem('s3') + +oisst_files = fs.glob('s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/202408/oisst-avhrr-v02r01.*.nc') + +oisst_files = sorted(['s3://'+f for f in oisst_files]) +#['s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100101.nc', +# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100102.nc', +# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100103.nc', +# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100104.nc', +#... +#] +``` + +Now that we have the filenames of the data we need, we can create virtual datasets with `VirtualiZarr`. This may take a minute. + +```python +from virtualizarr import open_virtual_dataset + +virtual_datasets =[ + open_virtual_dataset(url, indexes={}) + for url in oisst_files +] +``` + +We can now use `xarray` to combine these virtual datasets into one large virtual dataset. We know that each of our files share the same structure but with a different date. So we are going to concatenate these datasets on the `time` dimension. + +```python +import xarray as xr + +virtual_ds = xr.combine_nested( + virtual_datasets, + concat_dim=['time'], + coords='minimal', + compat='override', + combine_attrs='override' +) + +# Size: 257MB +#Dimensions: (time: 31, zlev: 1, lat: 720, lon: 1440) +#Coordinates: +# time (time) float32 124B ManifestArray Size: 1GB +#Dimensions: (lon: 1440, time: 31, zlev: 1, lat: 720) +#Coordinates: +# * lon (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9 +# * zlev (zlev) float32 4B 0.0 +# * time (time) datetime64[ns] 248B 2024-08-01T12:00:00 ... 2024-08-31T12... +# * lat (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88 +#Data variables: +# sst (time, zlev, lat, lon) float64 257MB dask.array +# ice (time, zlev, lat, lon) float64 257MB dask.array +# anom (time, zlev, lat, lon) float64 257MB dask.array +# err (time, zlev, lat, lon) float64 257MB dask.array +``` + +Success! We have created our full dataset with 31 timesteps, ready for analysis! From 88547156e61abe2d952848d0768363de68c06329 Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Mon, 14 Oct 2024 22:31:42 -0400 Subject: [PATCH 03/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Tom Nicholas --- docs/docs/icechunk-python/virtual.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 2fc465ad..e5e4510a 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -125,7 +125,8 @@ Now we can read the dataset from the store using xarray to confirm everything we ds = xr.open_zarr( store, zarr_version=3, - consolidated=False, chunks={} + consolidated=False, + chunks={}, ) # Size: 1GB From 41191a4556555fc45ba0b0c442702a7a14c653dd Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Mon, 14 Oct 2024 22:32:03 -0400 Subject: [PATCH 04/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Tom Nicholas --- docs/docs/icechunk-python/virtual.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index e5e4510a..80b5c75d 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -82,7 +82,7 @@ We have a virtual dataset with 31 timestamps! Let's create an Icechunk store to !!! note - Take note of the `virtual_ref_config` passed into the `StoreConfig` when creating the store. This allows the icechunk store to have the necessary credentials to access the netCDF data on s3. For more configuration options, see the [configuration page](./configuration.md). + Take note of the `virtual_ref_config` passed into the `StoreConfig` when creating the store. This allows the icechunk store to have the necessary credentials to access the referenced netCDF data on s3 at read time. For more configuration options, see the [configuration page](./configuration.md). ```python from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig From 89ea0daa74cae490a85810da42b33d362e50b17a Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Mon, 14 Oct 2024 22:32:26 -0400 Subject: [PATCH 05/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Tom Nicholas --- docs/docs/icechunk-python/virtual.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 80b5c75d..134c5ae8 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -1,6 +1,6 @@ # Virtual Datasets -While Icechunk works wonderful with native chunks managed by zarr, there are many times where creating a dataset relies on existing archived data. To allow this, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. +While Icechunk works wonderfully with native chunks managed by zarr, there are many times where creating a dataset relies on existing archived data. To allow this, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. !!! warning From 2b262b06fa8c3d36b7b6d5687cab2da0ab3714bd Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Mon, 14 Oct 2024 22:32:34 -0400 Subject: [PATCH 06/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Tom Nicholas --- docs/docs/icechunk-python/virtual.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 134c5ae8..842f1b38 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -111,7 +111,7 @@ The refs are written so lets save our progress by committing to the store. !!! note - The commit hash will be different! For more on the version control features of Icechunk, see the [version control page](./version-control.md). + Your commit hash will be different! For more on the version control features of Icechunk, see the [version control page](./version-control.md). ```python store.commit() From fa0bd56a2c47d5e9bc3d0d8603a8aa4d1868ae31 Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Tue, 15 Oct 2024 09:24:51 -0400 Subject: [PATCH 07/12] Refine documentation --- docs/docs/icechunk-python/virtual.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 842f1b38..a3a1a94c 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -6,11 +6,15 @@ While Icechunk works wonderfully with native chunks managed by zarr, there are m While virtual references are fully supported in Icechunk, creating virtual datasets relies on using experimental or pre-release versions of open source tools. For full instructions on how to install the required tools and ther current statuses [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197). -To create virtual Icechunk datasets with python, we utilize the [kerchunk](https://fsspec.github.io/kerchunk/) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) packages. `kerchunk` allows us to extract virtual references from existing data files, and `VirtualiZarr` allows us to use `xarray` to combine these extracted virtual references into full blown datasets. +To create virtual Icechunk datasets with python, the community utilizes the [kerchunk](https://fsspec.github.io/kerchunk/) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) packages. -## Creating a virtual dataset +`kerchunk` allows scanning the metadata of existing data files to extract virtual references. It also provides methods to combine these references into [larger virtual datasets](https://fsspec.github.io/kerchunk/tutorial.html#combine-multiple-kerchunked-datasets-into-a-single-logical-aggregate-dataset), which can be exported to it's [reference format](https://fsspec.github.io/kerchunk/spec.html). -We are going to create a virtual dataset with all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3 with one netCDF file full of SST data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. +`VirtualiZarr` lets users ingest existing data files into virtual datasets using various different tools under the hood, including `kerchunk`, `xarray`, `zarr`, and now `icechunk`. It does so by creating virtual references to existing data that can be combined and manipulated to create larger virtual datasets using `xarray`. These datasets can then be exported to `kerchunk` reference format or to an `Icechunk` store, without ever copying or moving the existing data files. + +## Creating a virtual dataset with VirtualiZarr + +We are going to create a virtual dataset with all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. !!! note @@ -51,14 +55,14 @@ virtual_datasets =[ ] ``` -We can now use `xarray` to combine these virtual datasets into one large virtual dataset. We know that each of our files share the same structure but with a different date. So we are going to concatenate these datasets on the `time` dimension. +We can now use `xarray` to combine these virtual datasets into one large virtual dataset (For more details on this operation see [`VirtualiZarr`'s documentation](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets)). We know that each of our files share the same structure but with a different date. So we are going to concatenate these datasets on the `time` dimension. ```python import xarray as xr -virtual_ds = xr.combine_nested( +virtual_ds = xr.concat( virtual_datasets, - concat_dim=['time'], + dim='time', coords='minimal', compat='override', combine_attrs='override' @@ -78,7 +82,7 @@ virtual_ds = xr.combine_nested( # err (time, zlev, lat, lon) int16 64MB ManifestArray ``` -Success! We have created our full dataset with 31 timesteps, ready for analysis! +Success! We have created our full dataset with 31 timesteps spanning the month of august, all with virtual references to pre-existing data files in object store. This means we can now version control our dataset, allowing us to update it, and roll it back to a previous version without copying or moving any data from the original files. \ No newline at end of file From d2ab8cb5c2d6d7dfb6edd7833f4a26c77df5215a Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Tue, 15 Oct 2024 09:34:55 -0400 Subject: [PATCH 08/12] More details --- docs/docs/icechunk-python/virtual.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index a3a1a94c..c49df60f 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -1,6 +1,6 @@ # Virtual Datasets -While Icechunk works wonderfully with native chunks managed by zarr, there are many times where creating a dataset relies on existing archived data. To allow this, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. +While Icechunk works wonderfully with native chunks managed by zarr, there are many times where creating a dataset relies on existing archived data. To allow this, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. Virtual chunks simply load the raw data from the source data's original location without copying. moving, or modifying the original data files. This allows for using Icechunk to manage large datasets from existing data without needing that data to be in `zarr` format. !!! warning @@ -14,7 +14,7 @@ To create virtual Icechunk datasets with python, the community utilizes the [ker ## Creating a virtual dataset with VirtualiZarr -We are going to create a virtual dataset with all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. +We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. !!! note From 1874a190d556f23b9366884b9a311e40cde9e360 Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Tue, 15 Oct 2024 10:12:54 -0400 Subject: [PATCH 09/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Ryan Abernathey --- docs/docs/icechunk-python/virtual.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index c49df60f..301e5791 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -1,6 +1,6 @@ # Virtual Datasets -While Icechunk works wonderfully with native chunks managed by zarr, there are many times where creating a dataset relies on existing archived data. To allow this, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. Virtual chunks simply load the raw data from the source data's original location without copying. moving, or modifying the original data files. This allows for using Icechunk to manage large datasets from existing data without needing that data to be in `zarr` format. +While Icechunk works wonderfully with native chunks managed by Zarr, there is lots of archival data out there in other formats already. To interoperate with such data, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. Virtual chunks are loaded directly from the original source without copying or modifying the original achival data files. This enables Icechunk to manage large datasets from existing data without needing that data to be in Zarr format already. !!! warning From 09f9488f64f18d8d992bc6d654c35c8ef67f3d5e Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Tue, 15 Oct 2024 10:13:01 -0400 Subject: [PATCH 10/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Ryan Abernathey --- docs/docs/icechunk-python/virtual.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 301e5791..47570491 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -4,7 +4,8 @@ While Icechunk works wonderfully with native chunks managed by Zarr, there is lo !!! warning - While virtual references are fully supported in Icechunk, creating virtual datasets relies on using experimental or pre-release versions of open source tools. For full instructions on how to install the required tools and ther current statuses [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197). + While virtual references are fully supported in Icechunk, creating virtual datasets currently relies on using experimental or pre-release versions of open source tools. For full instructions on how to install the required tools and ther current statuses [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197). + With time, these experimental features will make their way into the released packages. To create virtual Icechunk datasets with python, the community utilizes the [kerchunk](https://fsspec.github.io/kerchunk/) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) packages. From e864b137ae296681b9086540f1efbc941636b28f Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Tue, 15 Oct 2024 10:13:09 -0400 Subject: [PATCH 11/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Ryan Abernathey --- docs/docs/icechunk-python/virtual.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 47570491..2cb2ef34 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -7,7 +7,7 @@ While Icechunk works wonderfully with native chunks managed by Zarr, there is lo While virtual references are fully supported in Icechunk, creating virtual datasets currently relies on using experimental or pre-release versions of open source tools. For full instructions on how to install the required tools and ther current statuses [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197). With time, these experimental features will make their way into the released packages. -To create virtual Icechunk datasets with python, the community utilizes the [kerchunk](https://fsspec.github.io/kerchunk/) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) packages. +To create virtual Icechunk datasets with Python, the community utilizes the [kerchunk](https://fsspec.github.io/kerchunk/) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) packages. `kerchunk` allows scanning the metadata of existing data files to extract virtual references. It also provides methods to combine these references into [larger virtual datasets](https://fsspec.github.io/kerchunk/tutorial.html#combine-multiple-kerchunked-datasets-into-a-single-logical-aggregate-dataset), which can be exported to it's [reference format](https://fsspec.github.io/kerchunk/spec.html). From b60d6a2caeee68fd49a134837422731c42447bbe Mon Sep 17 00:00:00 2001 From: Matthew Iannucci Date: Tue, 15 Oct 2024 10:13:17 -0400 Subject: [PATCH 12/12] Update docs/docs/icechunk-python/virtual.md Co-authored-by: Ryan Abernathey --- docs/docs/icechunk-python/virtual.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/icechunk-python/virtual.md b/docs/docs/icechunk-python/virtual.md index 2cb2ef34..9abc2f18 100644 --- a/docs/docs/icechunk-python/virtual.md +++ b/docs/docs/icechunk-python/virtual.md @@ -19,7 +19,7 @@ We are going to create a virtual dataset pointing to all of the [OISST](https:// !!! note - At this point you should have followed the instructions [here](https://github.com/earth-mover/icechunk/issues/197) to install the necessary tools. + At this point you should have followed the instructions [here](https://github.com/earth-mover/icechunk/issues/197) to install the necessary experimental dependencies. Before we get started, we also need to install `fsspec` and `s3fs` for working with data on s3.