Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert and publish GPM IMERG dataset to COG #73

Closed
4 tasks
abarciauskas-bgse opened this issue May 24, 2022 · 15 comments
Closed
4 tasks

Convert and publish GPM IMERG dataset to COG #73

abarciauskas-bgse opened this issue May 24, 2022 · 15 comments
Assignees

Comments

@abarciauskas-bgse
Copy link
Contributor

Epic

None, but to support the ArcGIS Enterprise in the Cloud Effort

Description

Convert the half-hour product to COG for use by ADC initiative

Background

Brian Tisdale who is leading the ArcGIS Enterprise in the Cloud effort reached out on slack:

The newly formed ArcGIS Enterprise in the Cloud team is starting to get their footing and ready to dive into the details of how the GIS component of VEDA will be integrated. I know we have the larger stakeholder sync next week but hoping we can coordinate on a few questions prior. As GPM is a priority dataset for both VEDA and ArcGIS Enterprise, we'd like to propose focusing on it for initial prototyping to inform the cross-team decisions that will need to be made. Provided below is a link to our GPM ArcGIS Image Service API. It's currently hosted at Goddard but will be migrated to the Earthdata Cloud as part of the ArcGIS Enterprise in the Cloud activity. Most of our initial questions are based on how the COG generation is going to occur. Do you know if VEDA or EIS has started COG generation for GPM?
GPM ArcGIS Image Service API: https://arcgis.gesdisc.eosdis.nasa.gov/authoritative/rest/services/GPM_3IMERGHHE_06/ImageServer

I sent Brian an email message:
If I understand correctly, to support the ADC (or is it a different acronym now "ArcGIS Enterprise in the Cloud"?) we want to:

  1. Ingest and publish GPM 3IMERG HHE 06 data into the VEDA metadata API (STAC)
  2. We want to create Cloud-Optimized GeoTIFFs to support services described below
  3. Publish services for visualization: I see there is a WMS service - would WMTS be ok to support ArGIS in the Cloud or it must be WMS?
  4. Publish services for access: We will need WCS support for ArcGIS Enterprise in the Cloud

GPM IMERG is a high value first example of executing the above steps but there will be many other datasets to follow a similar model to the above.

Acceptance Criteria:

  • GPM IMERG in COG shared with Brian Tisdale for inspection
  • Publish a few files to dev STAC API and share API for dynamic visualization and testing in ArcGIS Enterprise interface
  • Document some lessens and best practices for COG conversion and publishing to support clients like ArcGIS. What did we learn about the conversion scripts required and challenges that should inform the final API design, UI design, and implementation?
  • Demo any API and UI developed in this effort
@sharkinsspatial
Copy link

@abarciauskas-bgse Out of curiosity what are the plans for COG layout for the IMERG variables? Will you create multi-band COGs with variables or a host of single band COGs with variable naming conventions? If there is a consideration for generating COGs for large numbers of netCDF files it might be worthwhile to consult with the user community as we’ll be diverging from the commonly accepted CF Conventions https://cfconventions.org/ which most scientific producers and consumers try to adhere to. For a reference example of working with the IMERG data here is the recipe we developed for pangeo-forge https://github.com/pangeo-forge/gpm-imerge-hhr-feedstock/blob/main/feedstock/recipe.py

Another consideration is the update strategy. We are still considering our incremental append strategy for pangeo-forge but we should have something well defined in the next few sprints. But this is a question that has been brought up recently in relation to the IMERG data pangeo-forge/gpm-imerge-hhr-feedstock#2

@abarciauskas-bgse
Copy link
Contributor Author

@sharkinsspatial these are all good questions.

  • For IMERG, I think @ingalls is starting by creating an API and UI so it is easy to modify the configuration for how variables are selected and named. @ingalls have you considered how to specify things like which variables correspond to which bands, in 1 to many files, and the option for variable file naming? I'm assuming this means that if one wishes to store a different variable for each output COG, I would configure the generation to name the output file with a substring which includes the band/variable name.

  • I need to read up on CF conventions so I will have to get back on the question about how we can adhere to CF Conventions for IMERG and future collections.

In general, I want to centralize questions and answers about generating cloud-optimized (analysis-ready?) data. So far @wildintellect has helped start these documents:

I would be interested to know what you @sharkinsspatial think about the layout and content so far in those documents. I know there are a lot of resources on COG and Zarr out there, but I think the intention with these documents is to be able to point our stakeholders somewhere when they are looking for guidance in creating COGs or Zarr.

@ingalls
Copy link

ingalls commented Jun 14, 2022

@abarciauskas-bgse Current codebase is here as we sketch this out: https://github.com/developmentseed/raster-uploader/

  • Infra for hosting API & database has been deployed
  • Support for uploading user supplied rasters => s3 deployed
  • Full upload management API has been deployed
  • Upload "Steps" have been sketched out which will provide the ability to users to dynamically alter rasters in an interactive way throughout the import process

Current API Location: raster-uploader-prod-1759918000.us-east-1.elb.amazonaws.com
Username: default
Password: [DM Me]

Screenshot from 2022-06-14 08-24-00
Screenshot from 2022-06-14 08-24-04

@abarciauskas-bgse
Copy link
Contributor Author

@ingalls got this working today (I believe, still looking at the result and making sure it looks correct)

https://github.com/NASA-IMPACT/cloud-optimized-data-pipelines/tree/ab/updates-for-imerg/docker/hdf5-to-cog#gpm-imerg-example

so will generate a few samples tomorrow to send to the ADC team

@abarciauskas-bgse
Copy link
Contributor Author

@ingalls can you share the IMERG COG output you generated with raster-uploader, along with what was the source NetCDF and the config you used to generate it? I want to compare it with the one I produced and previously shared with the ADC team.

@ingalls
Copy link

ingalls commented Jul 27, 2022

@abarciauskas-bgse The general directory can be found here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/

The input file exists here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/imerg_test.nc

And the precipitationCal output exists here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/step/77/final.tif

I just grabbed a random IMERG dataset to use for testing. Would be happy to get some time on the caledar and run through your process vs mine with the same input file. Alternatively happy to do it async if you can provide an input file that you used to make sure we have parity.

@abarciauskas-bgse
Copy link
Contributor Author

I'm going to probably try this myself but did you generate this before or after you added the flipping option? When I compare it the sample i created it makes me think one is flipped and one is not but it could depend on the source.

comparing the one i generated:

Screen Shot 2022-07-28 at 4 36 31 PM

https://ejd872yh78.execute-api.us-east-1.amazonaws.com/cog/preview?url=s3%3A%2F%2Fveda-data-store-staging%2FGPM_3IMERGHHE.06%2F3B-HHR-E.MS.MRG.3IMERG.20220101-S000000-E002959.0000.V06B.HDF5.tif&unscale=false&resampling=nearest&rescale=0%2C10&colormap_name=blues_r&return_mask=true

with the one linked above: (locally using rio viz)

Screen Shot 2022-07-28 at 4 35 26 PM

For reference, I think the file you generated was using https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHHE.06/2022/167/3B-HHR-E.MS.MRG.3IMERG.20220616-S000000-E002959.0000.V06C.HDF5 by gdalinfo'ing the netCDF file

@abarciauskas-bgse
Copy link
Contributor Author

Just also noting some of the conversation from email and slack:

  • Nick is working on discovery + fan out of the GPM IMERG files, which currently relies on discovery from the HTTPS directory and requesting files via EDL
  • We are asking via email Owen Kelly + George Huffman about getting the files another way. George mentioned "in bulk" so if there is no way to discover + fan out directly from their servers, perhaps we could bulk download the HDFs to an S3 "landing zone", discover + fan out COG generation from the S3 location, and then remove the HDF files.
    cc @ingalls

@wildintellect
Copy link
Contributor

There's another https way to access IMERG that does not use EDL which we used in the Pangeo-Forge recipe (@sharkinsspatial and I wrote).
Also the naming pattern is very well known, no need to discover it once you know the date range and product you want.
https://github.com/pangeo-forge/staged-recipes/blob/b3f80f1e23ff9df1a1cf9622a7d7fa9107305754/recipes/gpm-imerg/recipe.py#L11-L26

I believe this access method might allow for fsspec (or s3fs) access to the files without pre-download.

cc: @abarciauskas-bgse @ingalls @sharkinsspatial

@wildintellect
Copy link
Contributor

Here's the bulk access instructions https://gpm.nasa.gov/sites/default/files/2021-01/arthurhouhttps_retrieval.pdf

@abarciauskas-bgse abarciauskas-bgse self-assigned this Aug 12, 2022
@abarciauskas-bgse
Copy link
Contributor Author

abarciauskas-bgse commented Aug 12, 2022

I picked this up again and started the work deploying and testing it, and everything is going smoothly, kudos @slesaad @xhagrg for the veda-data-pipelines refactor. Work is in https://github.com/NASA-IMPACT/veda-data-pipelines/tree/ab/deploy-for-imerg

Work to go:

  • Get Owen and George to 👍🏽 the COG (In-progress)
  • Get Owen and George to 👍🏽 the metadata
  • Merge updated main (which has @alukach updated register STAC code) and test workflow still works
  • Publish a few granules to develop
  • Check with ADC team that the location and metadata look reasonable to them (do they need the metadata or just the files?)
  • Test on a larger subset and verify we can scale to total number of granules
  • Publish all granules to staging (389,168)

@smohiudd
Copy link
Contributor

smohiudd commented Dec 7, 2022

I uploaded around 50 COG samples to s3://climatedashboard-data/GPM_3IMERGHHE/

@abarciauskas-bgse can we send this to Owen and George for review?

@abarciauskas-bgse
Copy link
Contributor Author

Thanks @smohiudd - sorry if this wasn't clear but we should put them in s3://veda-data-store-staging before sending it to them so they ensure they can access the files when they are in an "official" "staging" (though it should eventually be in s3://veda-data-store) bucket.

@j08lue
Copy link
Contributor

j08lue commented Mar 1, 2023

The GPM IMERG data is also available as Zarr - does not help us for viz, but is relevant to include in our catalog anyways.

@gadomski gadomski transferred this issue from NASA-IMPACT/veda-data-pipelines Sep 22, 2023
@j08lue
Copy link
Contributor

j08lue commented Sep 29, 2023

Stale

@j08lue j08lue closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants