From b7c488121e8c248a0be014daf1882aa7954fc659 Mon Sep 17 00:00:00 2001 From: Kristin Cowalcijk Date: Fri, 22 Mar 2024 08:10:20 +0800 Subject: [PATCH] Add documentation for writing Python UDF to work with raster data --- docs/tutorial/raster.md | 48 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/docs/tutorial/raster.md b/docs/tutorial/raster.md index b95141b6d4..cfc64bc9c6 100644 --- a/docs/tutorial/raster.md +++ b/docs/tutorial/raster.md @@ -621,6 +621,54 @@ ds = raster.as_rasterio() # rasterio.DatasetReader object band1 = ds.read(1) # read the first band ``` +## Writing Python UDF to work with raster data + +You can write Python UDFs to work with raster data in Python. The UDFs can take `SedonaRaster` objects as input and +return any Spark data type as output. This is an example of a Python UDF that calculates the mean of the raster data. + +```python +from pyspark.sql.types import DoubleType + +def mean_udf(raster): + return float(raster.as_numpy().mean()) + +sedona.udf.register("mean_udf", mean_udf, DoubleType()) +df_raster.withColumn("mean", expr("mean_udf(rast)")).show() +``` + +``` ++--------------------+------------------+ +| rast| mean| ++--------------------+------------------+ +|GridCoverage2D["g...|1542.8092886117788| ++--------------------+------------------+ +``` + +It is much trickier to write an UDF that returns a raster object, since Sedona does not support serializing Python raster +objects yet. However, you can write a UDF that returns the band data as an array and then construct the raster object using +`RS_MakeRaster`. This is an example of a Python UDF that creates a mask raster based on the first band of the input raster. + +```python +from pyspark.sql.types import ArrayType, DoubleType +import numpy as np + +def mask_udf(raster): + band1 = raster.as_numpy()[0,:,:] + mask = (band1 < 1400).astype(np.float64) + return mask.flatten().tolist() + +sedona.udf.register("mask_udf", band_udf, ArrayType(DoubleType())) +df_raster.withColumn("mask", expr("mask_udf(rast)")).withColumn("mask_rast", expr("RS_MakeRaster(rast, 'I', mask)")).show() +``` + +``` ++--------------------+--------------------+--------------------+ +| rast| mask| mask_rast| ++--------------------+--------------------+--------------------+ +|GridCoverage2D["g...|[0.0, 0.0, 0.0, 0...|GridCoverage2D["g...| ++--------------------+--------------------+--------------------+ +``` + ## Performance optimization When working with large raster datasets, refer to the [documentation on storing raster geometries in Parquet format](../storing-blobs-in-parquet) for recommendations to optimize performance.