Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICESat-2 ML tutorial - photon classification on ATL07 sea ice data #17

Merged
merged 21 commits into from
Aug 18, 2024

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Aug 6, 2024

Draft tutorial for doing ICESat-2 ATL07 photon classification into 3 surface types (open water, thin ice, thick/snow-covered ice). Deciding to do a reimplementation of @YoungHyunKoo's code at https://github.com/YoungHyunKoo/IS2_ML.

Preview at https://deploy-preview-17--icesat2-website2024.netlify.app/tutorials/machine-learning/photon_classifier

ATL07_point_cloud_classifier

Excalidraw file: ATL07_point_cloud_classifier.excalidraw.tar.gz

TODO:

  • Initial layout with learning objectives and sections
  • Pre-processing code to read ATL07 sea ice data (6 data variables) from HDF5 to a geopandas.GeoDataFrame
  • Find coincident Sentinel-2 image to extract surface reflectance values
  • Show how to move data from disk/CPU to GPU
  • Architect the Machine Learning model
  • Show training results

Xref: uwhackweek/schedule-2024#38

References:

  • Koo, Y., Xie, H., Kurtz, N. T., Ackley, S. F., & Wang, W. (2023). Sea ice surface type classification of ICESat-2 ATL07 data by using data-driven machine learning model: Ross Sea, Antarctic as an example. Remote Sensing of Environment, 296, 113726. https://doi.org/10.1016/j.rse.2023.113726

First draft with a rough layout of sections for the ICESat-2 ML photon classification tutorial. Included learning objectives, and some initial code to read ATL07 sea ice data from HDF5 to a geopandas.GeoDataFrame. Deciding to do a reimplementation of the Koo et al., 2023 paper with code at https://github.com/YoungHyunKoo/IS2_ML.
@weiji14 weiji14 self-assigned this Aug 6, 2024
@weiji14 weiji14 added the preview Create a website preview label Aug 6, 2024
Copy link
Contributor

github-actions bot commented Aug 6, 2024

Show how to save geopandas.GeoDataFrame to a GeoParquet file, and load it back again. Also put down some notes about compression codecs.
Some quick code to convert the geopandas.GeoDataFrame to a torch.Tensor and put it in a torch DataLoader. Showing how to move data from CPU to GPU using the `.to` method. Might modify this section's title/subtitle later depending on how the code goes.
One more entry in the tutorial index page. Putting down Machine Learning and Pytorch as the topics, and ATL07 as the dataset used for now.
Less boilerplate s3fs code to manage, and not using icepyx means this should run on the Pangeo pytorch-notebook docker image too!
Comment on lines 53 to 55
# Authenticate using NASA EarthData login
auth = earthaccess.login()
s3 = earthaccess.get_s3fs_session(daac="NSIDC") # Start an AWS S3 session
Copy link
Member Author

@weiji14 weiji14 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing from #17 (comment), this is the current error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 3
      1 # Authenticate using NASA EarthData login
      2 auth = earthaccess.login()
----> 3 s3 = earthaccess.get_s3fs_session(daac="NSIDC")  # Start an AWS S3 session

File ~/micromamba/envs/hackweek/lib/python3.11/site-packages/earthaccess/api.py:352, in get_s3fs_session(daac, provider, results)
    350         session = earthaccess.__store__.get_s3fs_session(endpoint=endpoint)
    351         return session
--> 352 session = earthaccess.__store__.get_s3fs_session(daac=daac, provider=provider)
    353 return session

AttributeError: 'NoneType' object has no attribute 'get_s3fs_session'

Not sure if there's a way to pass auth credentials to GitHub Actions and/or Netlify build so that this line works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can definitely add secrets for EARTHDATA_USERNAME and EARTHDATA_PASSWORD, which would enable earthdata to login during actions. However GitHub actions run on azure, so s3 access isn't available. I think we'd have to build a self-hosted runner that deploys on AWS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we could execute these notebooks on a remotely booted CryoCloud user server. I think all the pieces exist to do this nowadays? @yuvipanda I saw this relevant issue NASA-IMPACT/veda-jupyterhub#46 as I was exploring your latest awesome 2i2c tech including https://github.com/yuvipanda/jupyter-sshd-proxy. Is there a straightforward way to start a user server via CI? Then you'd just have to fire up an ssh connection, execute a notebook, copy and commit the rendered version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is if CryoCloud had a BinderHub, we could follow Project Pythia and use execute_notebooks: binder in _config.yml (Example here, xref https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705/14). The ssh option would be pretty cool to get working though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiji14 For now I recommend executing the notebook on CryoCloud, saving with outputs and adding an entry to not execute it in CI here:

exclude_patterns:
- "**/geospatial-advanced.ipynb"

Looking for a coincident alignment of two satellites (ICESat-2 and Sentinel-2) capturing data at the same time! Managed to find a coincident capture on 2019-02-24, though haven't checked if the spatial extent matches yet. Can improve the search algorithm later by expanding the search time window (+/- X minutes) and using a more exact bounding box search in the STAC API query.
Temporal match wasn't enough, so adding the spatial match as well. Metadata on ICESat-2 was lacking unfortunately, so need to open the ATL07 HDF5 file to get the xy coordinates and build a linestring from it to pass to the STAC query. Managed to find a lucky coincident match on 2019-10-31, and have verified that the crossover is valid.
Add the `x_atc`, `layer_flag` and `height_segment_ssh_flag` data variables to the GeoDataFrame which will be useful for plotting/filtering later. Using `height_segment_ssh_flag` to remove points that might be affected by clouds.
Get the Sentinel-2 RGB image, reproject the ATL07 points and subset to the image's bounding box, then plot them both using PyGMT! The plot colors sea ice points as blue, and sea surface (water) points as orange.
Use PyGMT's grdtrack to get the Sentinel-2 Red band's pixel values sampled at every ATL07 xy point, and then apply a simple threshold to classify into water (dark), thin ice (gray) and thick ice (white).
@weiji14 weiji14 mentioned this pull request Aug 13, 2024
Reorganizing some content so Part 2 is focused on preparing the DataLoader and neural network model architecture. Have now moved the dataloader for-loop to Part 3 'Training' and commented out the to CUDA parts. Also calculated "hist_mean_median_h_diff" column which is the actual variable we want to use in training.
Writing up section about choosing a machine learning algorithm, including ML models with different levels of complexity from decision trees to neural networks and state-of-the-art models. Also implemented a simple multi-layer perceptron model based on the description in Koo et al., 2023's paper (but without the tanh activation).
Finally got to the actual neural network model training! Now properly splitting the mini-batch data into input and target tensors, passing the input into the model to get the prediction, and minimizing the loss between prediction and target. Needed to do some ugly dtype casting to prevent `RuntimeError`s. Trying to keep this fairly basic without train/validation splits, and only ran this for 3 epochs. Have shifted some markdown blocks up where they belong too.
Default CryoCloud docker image won't have Pytorch, so will need to install it at the first step.
Default CryoCloud image now has Geopandas 1.x, so can save to a non-beta version of GeoParquet schema now.
@weiji14 weiji14 marked this pull request as ready for review August 16, 2024 22:25
@weiji14
Copy link
Member Author

weiji14 commented Aug 16, 2024

This should be ready for an initial round of reviews. If possible, I'd appreciate some help with the authentication issue at #17 (comment) (need to grab both an ATL07 file and also Sentinel-2), but I can also try to sort that out over the weekend.

There are a few other things I'd like to add to the notebook such as more explanation text at the start, and also show what the trained model's predicted ATL07 photon classifications look like, but that can be done in a follow-up PR.

Comment on lines 330 to 335
df_red = pygmt.grdtrack(
grid=da_image.sel(band="red").compute(), # Choose only the Red band
points=gdf.get_coordinates(), # x/y coordinates from ATL07
newcolname="red_band_value",
interpolation="n", # nearest neighbour
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first time I ran this on cryocloud I got a traceback:

RuntimeError: Error opening 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/2/C/ND/2019/10/S2B_2CND_20191031_0_L2A/B04.tif': RasterioIOError('No driver registered.')

But re-running succeeded... might be an intermittent thing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have to re-run this sometimes too. Maybe there's a short-lived token or something.

Copy link
Member

@scottyhq scottyhq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @weiji14! I wont have time for an in-depth review, but to me this looks like a really great tutorial! I wish we could do the direct S3 access in the CI workflow, but I think the easiest thing for now is just to commit a rendered notebook. Please go ahead and merge once you're happy with it

Adding an overview diagram of the ATL07 + Sentinel-2 processing pipeline (illustrated using Excalidraw) to the start of the notebook. Made some minor edits to some of the markdown cells to include more references and explanatory text.
Pushing the photon_classifier Jupyter Notebook with pre-rendered cells that was ran on CryoCloud. Putting the files under a 'machine-learning' folder, to be consistent with the other tutorials using subfolders.
@weiji14
Copy link
Member Author

weiji14 commented Aug 18, 2024

Thanks Scott, I've added an overview diagram at the top, and have pushed the pre-rendered notebook (6.0MB) 🙈. Will merge this in now, and might work on some extra stuff at the last section (e.g. show the model training results section a bit better) in a follow-up PR if there's time.

@weiji14 weiji14 merged commit 2aa88b8 into main Aug 18, 2024
4 checks passed
@weiji14 weiji14 deleted the photon_classifier branch August 18, 2024 06:49
@tsutterley
Copy link
Member

I know this PR is closed, but is this a photon classifier or a segment classifier? Photon classifiers label the individual photon events from ATL03 (e.g. YAPC or the land/veg classifier).

@weiji14
Copy link
Member Author

weiji14 commented Aug 18, 2024

I know this PR is closed, but is this a photon classifier or a segment classifier? Photon classifiers label the individual photon events from ATL03 (e.g. YAPC or the land/veg classifier).

Ah yes, I should probably have called this a point cloud classifier since ATL07 is based on a aggregate of ATL03 points. Let me fix that in a follow-up PR later.

Edit: Updates happening at #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Create a website preview
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants