Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find active subglacial lake points using unsupervised clustering #149

Merged
merged 12 commits into from
Sep 13, 2020

Conversation

weiji14
Copy link
Owner

@weiji14 weiji14 commented Aug 18, 2020

Pick out active subglacial lakes in Antarctica from (pre-processed) ICESat-2 point clouds automatically using unsupervised clustering techniques. Utilize RAPIDS AI GPU accelerated libraries to do so fast!

Icesat2Explorer Dashboard

Subglacial Lake clusters at Whillans Ice Stream

TODO:

  • Streamline Zarr ndarray to Parquet table conversion, to enable downstream work using RAPIDS AI libraries (8c2bf32, 2d1c8e7)
  • Use a fast GPU-based Point in Polygon algorithm to lump all points according to each Antarctic Drainage Basin (14b020c, 8efb862)
  • Improved visualizations, including an IceSat2Explorer dashboard (7945b3e, 5a8313a)
  • Use unsupervised clustering to find active subglacial lakes, those that drained, and those that filled (8efb862)
  • Build convex hull around point clusters, and save to shapefile/geojson format (8efb862)
  • etc

References:

Adding a new module to deepicedrain for Extract, Transform and Load (ETL) workflows! Putting slices of a 2D array into several columns inside a dataframe is now easier with the array_to_dataframe function. Inspired by dask/dask#5021. The function is generalized so that dask arrays convert to a dask DataFrame, and numpy arrays convert to a pandas DataFrame.
@weiji14 weiji14 added the feature 🚀 Brand new feature label Aug 18, 2020
Make bounding box subsetting work on DataFrames too! This includes pandas, dask and cudf DataFrames. Included a parametrized test for pandas and dask, the cudf one should work too since the APIs are similar. The original xarray.DataArray subsetter code will still work.
CUDA-accelerated GIS and spatiotemporal algorithms! A repeat of 58fdcdf, but with a newer version at v0.15.0! Also patch 852a643 by bumping up cuml version and switching the scikit-learn order.
A very fast way to find points inside polygons! This is really just a convenience function that wraps around `cuspatial.point_in_polygon`, hiding all sorts of boilerplate. Specifically, this handles:

1. Converting a geopandas geodataframe into a cuspatial friendly format, see rapidsai/cuspatial#165
2. Hacky workaround the 31 polygon limit using a for-loop, based on https://github.com/rapidsai/cuspatial/blob/branch-0.15/notebooks/nyc_taxi_years_correlation.ipynb
3. Outputting actual string labels from the geodataframe, instead of non human readable index numbers

Also added tests for this in test_spatiotemporal_gpu.py, though it won't work on the CI, only locally where a GPU is available.
Building on top of eb61ff6, but for n-dimensional arrays, and writing the dataframe to Parquet too! This function might be a little too convenient (read: contains hardcoding), but it smooths out some of the rough edges in terms of PyData file format interoperability. Should contribute this somewhere upstream when I get the time.
@weiji14 weiji14 force-pushed the cluster_active_subglacial_lakes branch 2 times, most recently from deeebf4 to b65b8ba Compare August 22, 2020 00:00
Improve our HvPlot/Panel dashboard with some new bells and whistles! Like a proper GIS desktop tool, the xy_dhdt dashboard plot can now keep the zoom level when changing between variables (thanks to https://discourse.holoviz.org/t/keep-zoom-level-when-changing-between-variables-in-a-scatter-plot)! Supersedes e4874b0. This is a major refresh of my old IceSatExplorer code at https://github.com/weiji14/cryospheric-data-lakes/blob/master/code/scripts/h5_to_np_icesat.ipynb, which uses ICESat-1 instead of ICESat-2. The dashboard also takes a lot of cues from the example at https://examples.pyviz.org/datashader_dashboard/dashboard.html, implemented in holoviz/datashader#676.

Other significant improvements include a categorical colourmap for the 'referencegroundtrack' variable, and being able to see the height and time of an ICESat-2 measurement at a particular cycle on hover over the points! Oh, and did I mention that the rendering now happens on the GPU?!! Data transformed to and from Parquet is fast! Note that this is a work in progress, and that there are more sweeping improvements to come. I've also split out the crossover analysis code into a separate atlxi_lake.ipynb file since atlxi_dhdt.ipynb was getting too long.
Parquet plugin for intake! Also edit Github Actions workflow to test on Pull Requests targeting any branch.
@weiji14 weiji14 force-pushed the cluster_active_subglacial_lakes branch from b65b8ba to 938ab84 Compare August 24, 2020 02:25
Improving the dashboard while making the code more maintainable by moving the pure hvplot scatterplot stuff into the intake atlas_catalog.yaml file, and placing the dashboard/widgets under vizplots.py. This is yet another attempt at tidying up the code in the jupyter notebook, moving them into the deepicedrain package instead! Also updated the alongtrack plot code to work with the new df_dhdt columnar data structure.

Will need to put the df_dhdt_{placename}.parquet data somewhere in the cloud (when I have time) so that the dashboard app can be used by more people, and also to enable unit testing of the visualization generators (always a tricky thing to test)! The dashboard is also currently hardcoded to plot the "whillans_upstream" area, will need to see the placename can be used as an argument into the IceSat2Explorer class.
@ghost
Copy link

ghost commented Aug 24, 2020

Congratulations 🎉. DeepCode analyzed your code in 2.454 seconds and we found no issues. Enjoy a moment of no bugs ☀️.

👉 View analysis in DeepCode’s Dashboard | Configure the bot

@weiji14 weiji14 force-pushed the cluster_active_subglacial_lakes branch 2 times, most recently from e289cbb to 3869f07 Compare August 24, 2020 12:20
@weiji14 weiji14 force-pushed the cluster_active_subglacial_lakes branch from 3869f07 to 5a8313a Compare August 24, 2020 12:22
Fix Continuous Integration tests failing due to the IceSat2Explorer class not being able to load df_dhdt_whillans_upstream.parquet. Really need to put the file up somewhere, but until I find a good data repository (ideally with versioning), this hacky workaround will be a necessary evil.
Pinning the RAPIDS AI libraries from the alpha/development versions to the stable release version. Also generating a environment-linux-64.lock for full reproducibility!

Bumps [cuml](https://github.com/rapidsai/cuml) from 0.15.0a200819 to 0.15.0.
- [Release notes](https://github.com/rapidsai/cuml/releases)
- [Changelog](https://github.com/rapidsai/cuml/blob/branch-0.15/CHANGELOG.md)
- [Commits](rapidsai/cuml@v0.15.0a...v0.15.0)

Bumps [cuspatial](https://github.com/rapidsai/cuspatial) from 0.15.0a200819 to 0.15.0
- [Release notes](https://github.com/rapidsai/cuspatial/releases)
- [Changelog](https://github.com/rapidsai/cuspatial/blob/branch-0.15/CHANGELOG.md)
- [Commits](rapidsai/cuspatial@v0.15.0a...v0.15.0)
Detect active subglacial lakes in Antarctica using Density-based spatial clustering of applications with noise (DBSCAN)! The subglacial lake detector works by finding clusters of high (filling at > 1m/yr) or low (draining at < -1 m/yr) height change over time (dhdt) values, for each drainage basin (that is grounded) in Antarctica. CUDA GPUs are awesome, the point in polygon takes 15 seconds, and the lake clustering takes 12 seconds, and this is working on >13 million points! Each cluster of points is then converted to a convex hull polygon, and we store some basic attribute information with the geometry such as the basin name, maximum absolute dhdt value, and reference ground tracks. The lakes are output to a geojson file using EPSG:3031 projection.

This is a long overdue commit as the code has been working since mid-August, but I kept wanting to refactor it (still need to!). The DBSCAN clustering parameters (eps=2500 and min_samples=250) work ok for the Siple Coast and Slessor Glacier, but fails for Pine Island Glacier since there's a lot of downwasting. Algorithm definitely needs more work. The visualizations and crossover analysis code also need to be refreshed (since the schema has changed), but it's sitting locally on my computer, waiting to be tidied up a bit more.
Combining draining/filling active lake cluster labels, which allows us to reduce the number of for-loop nesting in the active subglacial lake finder code, and plot both draining/filling lakes in the same figure! Cluster labels are now negative integers for draining lakes, positive integers for filling lakes, and NaN for noise points. Lake cluster plot now uses red (draining) and blue (filling) 'polar' colormap, with unclassified noise points in black as before. Code still takes 11 seconds to run for the entire Antarctic continent which is awesome! Also made a minor change to deepicedrain/__init__.py script to disable loading IceSat2Explorer dashboard script otherwise `import deepicedrain` will load stuff into GPU memory!
@weiji14 weiji14 force-pushed the cluster_active_subglacial_lakes branch from 35ca4a2 to 8982919 Compare September 13, 2020 11:40
@sourcery-ai
Copy link

sourcery-ai bot commented Sep 13, 2020

Sourcery Code Quality Report (beta)

✅  Merging this PR will increase code quality in the affected files by 0.02 out of 10.

Quality metrics Before After Change
Complexity 0.86 0.82 -0.04 🔵
Method Length 142.20 115.37 -26.83 🔵
Quality 8.43 8.45 0.02 🔵
Other metrics Before After Change
Lines 1470 1343 -127
Changed files Quality Before Quality After Quality Change
atl11_play.py 4.91 4.91 0.00
atlxi_dhdt.py 5.55 4.21 -1.34 🔴
deepicedrain/init.py 8.75 8.66 -0.09 🔴
deepicedrain/spatiotemporal.py 8.82 8.41 -0.41 🔴
deepicedrain/tests/test_deepicedrain.py 9.69 9.68 -0.01 🔴
deepicedrain/tests/test_region.py 9.04 8.87 -0.17 🔴
deepicedrain/tests/test_spatiotemporal_conversions.py 8.58 8.58 0.00

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Overall Recommendation
deepicedrain/spatiotemporal.py point_in_polygon_gpu 4 172.88 6.11 Split out functionality

Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Let us know what you think of it by mentioning @sourcery-ai in a comment.

@weiji14 weiji14 marked this pull request as ready for review September 13, 2020 12:11
@weiji14 weiji14 merged commit 50757da into crossover_tracks Sep 13, 2020
@weiji14 weiji14 deleted the cluster_active_subglacial_lakes branch September 13, 2020 12:15
weiji14 added a commit that referenced this pull request Sep 15, 2020
#149)

Closes #149 Find active subglacial lake points using unsupervised clustering.
@weiji14 weiji14 added this to the v0.3.0 milestone Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🚀 Brand new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant