Find active subglacial lake points using unsupervised clustering #149

weiji14 · 2020-08-18T06:07:13Z

Pick out active subglacial lakes in Antarctica from (pre-processed) ICESat-2 point clouds automatically using unsupervised clustering techniques. Utilize RAPIDS AI GPU accelerated libraries to do so fast!

TODO:

Streamline Zarr ndarray to Parquet table conversion, to enable downstream work using RAPIDS AI libraries (8c2bf32, 2d1c8e7)
Use a fast GPU-based Point in Polygon algorithm to lump all points according to each Antarctic Drainage Basin (14b020c, 8efb862)
Improved visualizations, including an IceSat2Explorer dashboard (7945b3e, 5a8313a)
Use unsupervised clustering to find active subglacial lakes, those that drained, and those that filled (8efb862)
Build convex hull around point clusters, and save to shapefile/geojson format (8efb862)
etc

References:

Unsupervised Clustering:
- https://scikit-learn.org/stable/modules/clustering.html
- https://docs.rapids.ai/api/cuml/stable/api.html#dbscan
- Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems, 42(3), 1–21. https://doi.org/10.1145/3068335
- https://towardsdatascience.com/mapping-the-uks-traffic-accident-hotspots-632b1129057b
Convex/Concave Hull of Points:
- https://stackoverflow.com/questions/61740032/how-to-convert-convex-hull-vertices-into-a-geopandas-polygon
- https://towardsdatascience.com/the-concave-hull-c649795c0f0f

Adding a new module to deepicedrain for Extract, Transform and Load (ETL) workflows! Putting slices of a 2D array into several columns inside a dataframe is now easier with the array_to_dataframe function. Inspired by dask/dask#5021. The function is generalized so that dask arrays convert to a dask DataFrame, and numpy arrays convert to a pandas DataFrame.

Make bounding box subsetting work on DataFrames too! This includes pandas, dask and cudf DataFrames. Included a parametrized test for pandas and dask, the cudf one should work too since the APIs are similar. The original xarray.DataArray subsetter code will still work.

CUDA-accelerated GIS and spatiotemporal algorithms! A repeat of 58fdcdf, but with a newer version at v0.15.0! Also patch 852a643 by bumping up cuml version and switching the scikit-learn order.

A very fast way to find points inside polygons! This is really just a convenience function that wraps around `cuspatial.point_in_polygon`, hiding all sorts of boilerplate. Specifically, this handles: 1. Converting a geopandas geodataframe into a cuspatial friendly format, see rapidsai/cuspatial#165 2. Hacky workaround the 31 polygon limit using a for-loop, based on https://github.com/rapidsai/cuspatial/blob/branch-0.15/notebooks/nyc_taxi_years_correlation.ipynb 3. Outputting actual string labels from the geodataframe, instead of non human readable index numbers Also added tests for this in test_spatiotemporal_gpu.py, though it won't work on the CI, only locally where a GPU is available.

Building on top of eb61ff6, but for n-dimensional arrays, and writing the dataframe to Parquet too! This function might be a little too convenient (read: contains hardcoding), but it smooths out some of the rough edges in terms of PyData file format interoperability. Should contribute this somewhere upstream when I get the time.

Improve our HvPlot/Panel dashboard with some new bells and whistles! Like a proper GIS desktop tool, the xy_dhdt dashboard plot can now keep the zoom level when changing between variables (thanks to https://discourse.holoviz.org/t/keep-zoom-level-when-changing-between-variables-in-a-scatter-plot)! Supersedes e4874b0. This is a major refresh of my old IceSatExplorer code at https://github.com/weiji14/cryospheric-data-lakes/blob/master/code/scripts/h5_to_np_icesat.ipynb, which uses ICESat-1 instead of ICESat-2. The dashboard also takes a lot of cues from the example at https://examples.pyviz.org/datashader_dashboard/dashboard.html, implemented in holoviz/datashader#676. Other significant improvements include a categorical colourmap for the 'referencegroundtrack' variable, and being able to see the height and time of an ICESat-2 measurement at a particular cycle on hover over the points! Oh, and did I mention that the rendering now happens on the GPU?!! Data transformed to and from Parquet is fast! Note that this is a work in progress, and that there are more sweeping improvements to come. I've also split out the crossover analysis code into a separate atlxi_lake.ipynb file since atlxi_dhdt.ipynb was getting too long.

Parquet plugin for intake! Also edit Github Actions workflow to test on Pull Requests targeting any branch.

Improving the dashboard while making the code more maintainable by moving the pure hvplot scatterplot stuff into the intake atlas_catalog.yaml file, and placing the dashboard/widgets under vizplots.py. This is yet another attempt at tidying up the code in the jupyter notebook, moving them into the deepicedrain package instead! Also updated the alongtrack plot code to work with the new df_dhdt columnar data structure. Will need to put the df_dhdt_{placename}.parquet data somewhere in the cloud (when I have time) so that the dashboard app can be used by more people, and also to enable unit testing of the visualization generators (always a tricky thing to test)! The dashboard is also currently hardcoded to plot the "whillans_upstream" area, will need to see the placename can be used as an argument into the IceSat2Explorer class.

ghost · 2020-08-24T04:46:28Z

Congratulations 🎉. DeepCode analyzed your code in 2.454 seconds and we found no issues. Enjoy a moment of no bugs ☀️.

👉 View analysis in DeepCode’s Dashboard | Configure the bot

Fix Continuous Integration tests failing due to the IceSat2Explorer class not being able to load df_dhdt_whillans_upstream.parquet. Really need to put the file up somewhere, but until I find a good data repository (ideally with versioning), this hacky workaround will be a necessary evil.

Pinning the RAPIDS AI libraries from the alpha/development versions to the stable release version. Also generating a environment-linux-64.lock for full reproducibility! Bumps [cuml](https://github.com/rapidsai/cuml) from 0.15.0a200819 to 0.15.0. - [Release notes](https://github.com/rapidsai/cuml/releases) - [Changelog](https://github.com/rapidsai/cuml/blob/branch-0.15/CHANGELOG.md) - [Commits](rapidsai/cuml@v0.15.0a...v0.15.0) Bumps [cuspatial](https://github.com/rapidsai/cuspatial) from 0.15.0a200819 to 0.15.0 - [Release notes](https://github.com/rapidsai/cuspatial/releases) - [Changelog](https://github.com/rapidsai/cuspatial/blob/branch-0.15/CHANGELOG.md) - [Commits](rapidsai/cuspatial@v0.15.0a...v0.15.0)

Detect active subglacial lakes in Antarctica using Density-based spatial clustering of applications with noise (DBSCAN)! The subglacial lake detector works by finding clusters of high (filling at > 1m/yr) or low (draining at < -1 m/yr) height change over time (dhdt) values, for each drainage basin (that is grounded) in Antarctica. CUDA GPUs are awesome, the point in polygon takes 15 seconds, and the lake clustering takes 12 seconds, and this is working on >13 million points! Each cluster of points is then converted to a convex hull polygon, and we store some basic attribute information with the geometry such as the basin name, maximum absolute dhdt value, and reference ground tracks. The lakes are output to a geojson file using EPSG:3031 projection. This is a long overdue commit as the code has been working since mid-August, but I kept wanting to refactor it (still need to!). The DBSCAN clustering parameters (eps=2500 and min_samples=250) work ok for the Siple Coast and Slessor Glacier, but fails for Pine Island Glacier since there's a lot of downwasting. Algorithm definitely needs more work. The visualizations and crossover analysis code also need to be refreshed (since the schema has changed), but it's sitting locally on my computer, waiting to be tidied up a bit more.

Combining draining/filling active lake cluster labels, which allows us to reduce the number of for-loop nesting in the active subglacial lake finder code, and plot both draining/filling lakes in the same figure! Cluster labels are now negative integers for draining lakes, positive integers for filling lakes, and NaN for noise points. Lake cluster plot now uses red (draining) and blue (filling) 'polar' colormap, with unclassified noise points in black as before. Code still takes 11 seconds to run for the entire Antarctic continent which is awesome! Also made a minor change to deepicedrain/__init__.py script to disable loading IceSat2Explorer dashboard script otherwise `import deepicedrain` will load stuff into GPU memory!

sourcery-ai · 2020-09-13T11:40:35Z

Sourcery Code Quality Report (beta)

✅ Merging this PR will increase code quality in the affected files by 0.02 out of 10.

Quality metrics	Before	After	Change
Complexity	0.86	0.82	-0.04 🔵
Method Length	142.20	115.37	-26.83 🔵
Quality	8.43	8.45	0.02 🔵

Other metrics	Before	After	Change
Lines	1470	1343	-127

Changed files	Quality Before	Quality After	Quality Change
atl11_play.py	4.91	4.91	0.00
atlxi_dhdt.py	5.55	4.21	-1.34 🔴
deepicedrain/init.py	8.75	8.66	-0.09 🔴
deepicedrain/spatiotemporal.py	8.82	8.41	-0.41 🔴
deepicedrain/tests/test_deepicedrain.py	9.69	9.68	-0.01 🔴
deepicedrain/tests/test_region.py	9.04	8.87	-0.17 🔴
deepicedrain/tests/test_spatiotemporal_conversions.py	8.58	8.58	0.00

Here are some functions in these files that still need a tune-up:

File	Function	Complexity	Length	Overall	Recommendation
deepicedrain/spatiotemporal.py	point_in_polygon_gpu	4	172.88	6.11	Split out functionality

Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Let us know what you think of it by mentioning @sourcery-ai in a comment.

#149) Closes #149 Find active subglacial lake points using unsupervised clustering.

weiji14 added the feature 🚀 Brand new feature label Aug 18, 2020

weiji14 added 4 commits August 19, 2020 15:07

➕ Add cuspatial again

3c540c9

CUDA-accelerated GIS and spatiotemporal algorithms! A repeat of 58fdcdf, but with a newer version at v0.15.0! Also patch 852a643 by bumping up cuml version and switching the scikit-learn order.

weiji14 force-pushed the cluster_active_subglacial_lakes branch 2 times, most recently from deeebf4 to b65b8ba Compare August 22, 2020 00:00

weiji14 added 2 commits August 24, 2020 11:57

➕ Add intake-parquet

938ab84

Parquet plugin for intake! Also edit Github Actions workflow to test on Pull Requests targeting any branch.

weiji14 force-pushed the cluster_active_subglacial_lakes branch from b65b8ba to 938ab84 Compare August 24, 2020 02:25

weiji14 force-pushed the cluster_active_subglacial_lakes branch 2 times, most recently from e289cbb to 3869f07 Compare August 24, 2020 12:20

sourcery-ai bot mentioned this pull request Aug 24, 2020

Find active subglacial lake points using unsupervised clustering (Sourcery refactored) #152

Closed

weiji14 force-pushed the cluster_active_subglacial_lakes branch from 3869f07 to 5a8313a Compare August 24, 2020 12:22

weiji14 added 4 commits August 25, 2020 00:29

weiji14 force-pushed the cluster_active_subglacial_lakes branch from 35ca4a2 to 8982919 Compare September 13, 2020 11:40

weiji14 marked this pull request as ready for review September 13, 2020 12:11

weiji14 merged commit 50757da into crossover_tracks Sep 13, 2020

weiji14 deleted the cluster_active_subglacial_lakes branch September 13, 2020 12:15

weiji14 added a commit that referenced this pull request Sep 15, 2020

🔀 Merge branch 'cluster_active_subglacial_lakes' into crossover_tracks (

a2eac51

#149) Closes #149 Find active subglacial lake points using unsupervised clustering.

weiji14 added this to the v0.3.0 milestone Sep 16, 2020

weiji14 mentioned this pull request Oct 31, 2020

Parallelized Behavioural Driven Development testing #187

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find active subglacial lake points using unsupervised clustering #149

Find active subglacial lake points using unsupervised clustering #149

weiji14 commented Aug 18, 2020 •

edited

Loading

ghost commented Aug 24, 2020 •

edited by ghost

Loading

sourcery-ai bot commented Sep 13, 2020

Find active subglacial lake points using unsupervised clustering #149

Find active subglacial lake points using unsupervised clustering #149

Conversation

weiji14 commented Aug 18, 2020 • edited Loading

ghost commented Aug 24, 2020 • edited by ghost Loading

👉 View analysis in DeepCode’s Dashboard | Configure the bot

sourcery-ai bot commented Sep 13, 2020

Sourcery Code Quality Report (beta)

weiji14 commented Aug 18, 2020 •

edited

Loading

ghost commented Aug 24, 2020 •

edited by ghost

Loading