-
Notifications
You must be signed in to change notification settings - Fork 10
Data Access Patterns
A hydrofabric suitable for the NextGen Water Resources Framework is made avaialbe to the community as a cloud resource1.
In addition to the core hydrofabric products, a wide range of ancillary data can be associated with the fabrics to provide more opportunity for science, research, ML, and regionalization.
This wiki demonstrates how to find a subset of the CONUS dataset, and how to add data from (1) NextGen core models (2) hydroATLAS (3) EPA StreamCat and (4) USGS basin characteristics.
Each is slightly unique in what is needed to be done, but they all share similar data access patterns.
To start, we need to load the associated library:
library(hydrofabric)
#> ── Attaching packages ────────────────────────────────────── hydrofabric0.0.6 ──
#> ✔ dplyr 1.1.3 ✔ nhdplusTools 1.0.1
#> ✔ terra 1.7.55 ✔ hydrofab 0.5.0
#> ✔ ngen.hydrofab 0.0.3 ✔ zonal 0.0.2
#> ✔ climateR 0.3.1.4 ✔ glue 1.6.2
#> ✔ sf 1.0.14 ✔ arrow 13.0.0.1
#> ── Conflicts ──────────────────────────────────────── hydrofabric_conflicts() ──
#> ✖ arrow::buffer() masks terra::buffer()
#> ✖ terra::intersect() masks dplyr::intersect()
#> ✖ glue::trim() masks terra::trim()
#> ✖ terra::union() masks dplyr::union()
And define our
- desired outlet (in this case by hydrolocation),
- the location to write our subset network, and the
- s3 endpoint we will operate from:
# A hydrolocation URI
hl = 'Gages-04185000'
# The output directory
o = "data/gray_test.gpkg"
# s3 endpoint
s3_base = "s3://lynker-spatial/
NOTE: If you don't have, or lose access, to an s3 endpoint - or want faster repeated access - the
base_dir
option will allow all utilities here to work off of a local copy of the data.
Our first task is to build a subset from the desired outlet. While this example defines an outlet using a hydrolocation URI you can also provide a:
- a comid,
- hl_uri,
- XY,
- nldi_feature,
- nextgen ID
- geoconnex URL
# A hydrolocation URI
hl = 'Gages-04185000'
# The output directory
o = "data/gray_test.gpkg"
# Build subset
## writing all layers to "o"
get_subset(hl_uri = hl, outfile = o)
Created on 2023-10-17 by the reprex package (v2.0.1)
Now that we have a geopackage with the subset layers, it is possible to append it with data that has been summarized to the nextgen data, to the reference fabric, or, to an associated data product.
In all cases the ancillary data is hosted in the cloud either by Lynker, or, the USGS. Access to this data is always most direct using the network
layer of a hydrofabric. In the following example we show how to append the CFE/NOAH-OWP data to the subset divides:
- Access the hydrofabric network layer to identify the desired VPU
- Define an s3 file based on the VPU and attribute of choice
- Establish a connection to the remote file
- Filter to only include those divides in the subset network
- Pull down the data,
- Join it to the divides and plot
library(hydrofabric)
par = open_dataset(glue('{s3_base}/v20/conus_model_attributes.parquet')) %>%
filter(divide_id %in% div$divide_id) %>%
collect()
dim(par)
#> [1] 140 43
div = left_join(read_sf(o, "divides"), par, by = "divide_id")
plot(div['smcwlt_soil_layers_stag=2'])
Created on 2023-11-13 by the reprex package (v2.0.1)
HydroAtlas offers a global compendium of hydro-environmental characteristics for all sub-basins of HydroBASINS. We have indexed the reference features to the HydroAtlas/HydroBasin divides and provide the attribute data and this crosswalk amongst the cloud resources.
Since HydroBasins and the reference fabric stem for different source DEMs, there is not a scalable relation between the two. Further HydroAtlas is much courser then the NextGen fabric making a reference fabric centroid <--> hydroatlas basin about as good as it gets.
In the following example we append hydroAtlas data to the subset network:
- Define the variables of interest
- Access the hydrofabric network layer to identify a representative hf_id for each divide
- Define an s3 file based on the VPU and attribute of choice
- Establish a connection to the remote file
- Filter to only include those divides in the subset network, and select only the desired columns
- Pull down the data and join it to the network object (establishing a divide --> variable relation)
- Join it to the divides and plot
# Define HydroATLAS variables you want (can skip this and remove `select("hf_id", any_of(vars))` to get all)
vars = 'pet_mm_s01'
## READ NETWORK
net = read_sf(o, "network") %>%
select(divide_id, hf_id) %>%
filter(complete.cases(.)) %>%
group_by(divide_id) %>%
slice(1)
## EXTRACT HydroATLAS vars
ha = open_dataset(glue('{s3_base}tabular-resources/hydroATLAS/hydroatlas_vars.parquet')) %>%
filter(hf_id %in% net$hf_id) %>%
select("hf_id", any_of(vars)) %>%
collect() %>%
right_join(net, by = "hf_id")
## JOIN features and variables
div = left_join(read_sf(o, "divides"), ha, by = "divide_id")
## PLOT
plot(div[c(vars)])
Created on 2023-10-17 by the reprex package (v2.0.1)
StreamCat contains over 600 metrics that include local catchment (Cat), watershed (Ws), and special metrics. The special metrics were derived through modeling or by combining other StreamCat metrics. These variables include predicted water temperature, predicted biological condition, and the indexes of catchment and watershed integrity.
The catchment level is associated with the reference fabric used to build our products therefore there is an area weighted scaling that can be performed between the small (streamcat) and larger (nextgen) basins using zonal.
In the following example we append streamcat statsgo data to the subset network:
- List the streamcat parquet files in the s3 account (138 in total) and determine those containing statsgo data (n = 2)
- Access the hydrofabric network layer to identify the representative hf_ids and areas in each divide
- Establish a connection to the remote parquet file
- Filter to only include those reference featrues in the subset network, and select only the desired columns
- Pull down the data and join it to the network object (establishing a divide/area --> ref/area/value relation)
- Rescale the values using zonal::aggregate_zones
- Join it to the divides and plot
library(aws.s3)
sc_files = get_bucket_df(bucket = s3_base, prefix = 'tabular-resources/streamcats', region = "us-west-2") %>%
filter(!grepl("old|data", Key)) %>%
select(Key)
dim(sc_files)
#> [1] 138 1
(statsgo = filter(sc_files, grepl('STATSGO', Key)))
#> Key
#> 1 streamcats/streamcat_STATSGO_Set1_cat.parquet
#> 2 streamcats/streamcat_STATSGO_Set2_cat.parquet
ref = select(read_sf(o, "network"), hf_id, divide_id, areasqkm)
sc = open_dataset(glue('{s3_base}{statsgo$Key[1]}')) %>%
select(hf_id = COMID, ClayCat, SandCat, s_areasqkm = CatAreaSqKm) %>%
filter(hf_id %in% ref$hf_id) %>%
collect() %>%
left_join(ref, by = "hf_id")
sc = aggregate_zones(data = sc,
geom = NULL,
crosswalk = select(sc, hf_id, divide_id, areasqkm, s_areasqkm),
ID = "divide_id")
div = left_join(read_sf(o, "divides"), sc, by = "divide_id")
plot(div[c('ClayCat', 'SandCat')])
Created on 2023-10-17 by the reprex package (v2.0.1)
Mike Wieczorek and his team at the USGS regularly produce high quality summarization of critical environmental data to the NHDPlus. A cached version of the 2021 version2 this has been made available on ScienceBase.
We use the nhdplusTools
index of these to find the proper URLs for a desired attribute (e.g. TWI). From that a very similar rescaling process can take place. The exception here, is that, unlike StreamCat, the base parquet files do not store the reference features areas. To get around this, we need to add the reference feature areas to the network table using nhdplusTools::get_vaa()
TODO: perhaps we should add hf_area
to the network
tables?
- Identify the USGS TWI parquet file
- Access the hydrofabric network layer to identify the representative hf_id's and area of each divide.
- Join this to a table of reference feature areas.
- Establish a connection to the remote parquet file
- Filter to only include those reference features in the subset network, and select only the desired columns
- Pull down the data and join it to the network object (establishing a divide/area --> ref/area/value relation)
- Rescale the values using zonal::aggregate_zones
- Join it to the divides and plot
usgs_c = get_characteristics_metadata() %>%
filter(ID == 'TOT_TWI')
ref = select(read_sf(o, "network"), hf_id, divide_id, areasqkm) %>%
left_join(rename(get_vaa('areasqkm'), s_areasqkm = areasqkm), by = c('hf_id' = 'comid')) %>%
filter(complete.cases(.))
usgs_c = open_dataset(usgs_c$s3_url) %>%
select(hf_id = COMID, TOT_TWI) %>%
filter(hf_id %in% ref$hf_id) %>%
collect() %>%
left_join(ref, by = "hf_id")
usgs_c = aggregate_zones(data = usgs_c,
geom = NULL,
crosswalk = select(usgs_c, hf_id, divide_id, areasqkm, s_areasqkm),
ID = "divide_id")
div = left_join(read_sf(o, "divides"), usgs_c, by = "divide_id")
plot(div["TOT_TWI"])
Created on 2023-10-17 by the reprex package (v2.0.1)
Lastly, there are times when gridded data needs to be accessed. In a simple example, we have reformatted the data from the NWM 3.0 needed to run key NextGen modules. This has been save in the gridded-resources
directory of our s3 account.
This data can be downloaded, and accessed by the divides in the subset network:
r = rast(glue('{s3_base}gridded-resources/nwm/conus/{vars}.tif'))
d = crop(r, project(vect(div), crs(r)), snap = "out", mask = TRUE)
plot(d)
r2 = correct_nwm_spatial(glue('{base_s3}/ngen_gridded_data.nc'), lyrs = vars)
crop(r2, project(vect(div), crs(r)), snap = "out")
#> class : SpatRaster
#> dimensions : 60, 31, 3 (nrow, ncol, nlyr)
#> resolution : 1000, 1000 (x, y)
#> extent : 998000.4, 1029000, 238999.3, 298999.3 (xmin, xmax, ymin, ymax)
#> coord. ref. : Sphere_Lambert_Conformal_Conic
#> source(s) : memory
#> names : slope, ISLTYP, IVGTYP
#> min values : 0.007807857, 3, 1
#> max values : 0.297914475, 14, 16
Created on 2023-11-13 by the reprex package (v2.0.1)
climateR-catalogs
are a monthly updating archive of web accessable data that currently catalogs over 100,000 datasets from over 2,000 providers.
This catalog can be accessed in JSON or parquet format and is used not only for NextGen activites, but by the USGS for a range of activities (e.g. gdptools
)
The catalog is shipped with climateR
which provides generalized data access patterns for geospatial data (on the web and local)
For this example, lets say we are interest in the mean sand percentage in our divides across the 6 depth layers provided by POLARIS.
First we would need to reduce the catalog:
cat = filter(catalog, id == "polaris", grepl('mean sand', varname))
#> # A tibble: 6 × 29
#> id asset URL type varname variable description units model ensemble
#> <fct> <chr> <chr> <fct> <chr> <fct> <chr> <chr> <chr> <chr>
#> 1 polaris <NA> /vsicur… vrt mean s… mean sa… sand perce… % <NA> <NA>
#> 2 polaris <NA> /vsicur… vrt mean s… mean sa… sand perce… % <NA> <NA>
#> 3 polaris <NA> /vsicur… vrt mean s… mean sa… sand perce… % <NA> <NA>
#> 4 polaris <NA> /vsicur… vrt mean s… mean sa… sand perce… % <NA> <NA>
#> 5 polaris <NA> /vsicur… vrt mean s… mean sa… sand perce… % <NA> <NA>
#> 6 polaris <NA> /vsicur… vrt mean s… mean sa… sand perce… % <NA> <NA>
#> # ℹ 19 more variables: scenario <chr>, T_name <chr>, duration <chr>,
#> # interval <chr>, nT <int>, X_name <chr>, Y_name <chr>, X1 <dbl>, Xn <dbl>,
#> # Y1 <dbl>, Yn <dbl>, resX <dbl>, resY <dbl>, ncols <int>, nrows <int>,
#> # crs <chr>, toptobottom <lgl>, tiled <fct>, dim_order <fct>
Then, access the gridded data for these resources and out AOI:
div = read_sf(o, "divides")
sand_cat = dap(catalog = cat, AOI = div)
names(sand_cat) = paste0('layer', 1:6)
#> class : SpatRaster
#> dimensions : 1978, 1321, 6 (nrow, ncol, nlyr)
#> resolution : 0.0002777778, 0.0002777778 (x, y)
#> extent : -84.51528, -84.14833, 41.48556, 42.035 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> source(s) : memory
#> names : layer1, layer2, layer3, layer4, layer5, layer6
#> min values : 5.791708, 5.620117, 4.984344, 3.536133, 2.602639, 3.738049
#> max values : 88.154297, 88.329582, 90.301567, 92.887917, 94.666504, 95.716583
And finally, determine the per unit mean using zonal
sand = execute_zonal(data = sand_cat, geom = div, ID = "id")
Created on 2023-10-24 by the reprex package (v2.0.1)
In addition to the general dap
utilities provided by climateR
the package provides a suite of "shortcuts" for common dataset. For example we can use climateR to (1) extract, and (2) process a 58 year, monthly data products in a matter of seconds:
div = read_sf(o, "divides")
## 65 years of monthly 4km data
system.time({
tmax = getTerraClim(AOI = div, varname = "tmax")
})
#> user system elapsed
#> 0.524 0.051 20.557
#> $tmax
#> class : SpatRaster
#> dimensions : 14, 10, 780 (nrow, ncol, nlyr)
#> resolution : 0.04166667, 0.04166667 (x, y)
#> extent : -84.54167, -84.125, 41.45833, 42.04167 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=longlat +ellps=WGS84 +no_defs
#> source(s) : memory
#> names : tmax_~total, tmax_~total, tmax_~total, tmax_~total, tmax_~total, tmax_~total, ...
#> min values : -0.76, -2.77, 5.05, 14.88, 21.57, 22.97, ...
#> max values : 0.04, -1.90, 5.75, 15.91, 22.58, 23.94, ...
#> unit : degC, degC, degC, degC, degC, degC, ...
#> time : 1958-01-01 to 2022-12-01 UTC
Once accessed, the data can be summarized using zonal
for the divides of interest:
system.time({
z = execute_zonal(tmax, geom = div, ID = "id", fun = "mean")
})
#> user system elapsed
#> 0.462 0.024 0.491
Lets say instead of a monthly mean value over each month in the 58 year record, you wanted the mean monthly normal (e.g. mean January temperture). Since the data is in memory, this modififcation is quite simple:
system.time({
# Define an index
ind = rep(1:12, times = nlyr(tmax[[1]])/12)
# Summarize over index
x <- rast(lapply(1:12, function(i) mean(tmax[[1]][[ind[i]]])))
names(x) = month.abb
# summarize
z.mean = execute_zonal(x, geom = div, ID = "id", fun = "mean")
z.max = execute_zonal(x, geom = div, ID = "id", fun = "max")
})
#> user system elapsed
#> 0.285 0.007 0.296
Created on 2023-10-24 by the reprex package (v2.0.1)
In many cases, DEM data is needed. We store many variations of DEM in the climateR-catalog including 3DEP, NASADEM, GEBOC, and some OpenTopography sets
Here, we can quickly exemplify accessing a 3DEP DEM for our divides:
dem = climateR::get3DEP(div)
Created on 2023-10-24 by the reprex package (v2.0.1)
Now while we cover a huge amount of resources in the catalog, there are always datasets we dont. Fortunately, the dap
utilities can read from remote, non-cataloged resources as well. For example. we can point to an OpenDap endpoint:
system.time({
file <- dap(
URL = "https://cida.usgs.gov/thredds/dodsC/bcsd_obs",
AOI = div,
startDate = "1995-01-01",
verbose = FALSE,
varname = "pr"
) |>
execute_zonal(geom = div, fun = "mean", ID = "id", join = TRUE)
})
#> user system elapsed
#> 0.444 0.049 2.733
Created on 2023-10-24 by the reprex package (v2.0.1)
1 Johnson, J. M. (2022). National Hydrologic Geospatial Fabric (hydrofabric) for the Next Generation (NextGen) Hydrologic Modeling Framework, HydroShare, http://www.hydroshare.org/resource/129787b468aa4d55ace7b124ed27dbde ↩
2 Wieczorek, M.E., Jackson, S.E., and Schwarz, G.E., 2018, Select Attributes for NHDPlus Version 2.1 Reach Catchments and Modified Network Routed Upstream Watersheds for the Conterminous United States (ver. 3.0, January 2021): U.S. Geological Survey data release, doi:10.5066/F7765D7V ↩