Skip to content

Commit

Permalink
Add documentation for writing Python UDF to work with raster data
Browse files Browse the repository at this point in the history
  • Loading branch information
Kontinuation committed Mar 22, 2024
1 parent 8f452d2 commit b7c4881
Showing 1 changed file with 48 additions and 0 deletions.
48 changes: 48 additions & 0 deletions docs/tutorial/raster.md
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,54 @@ ds = raster.as_rasterio() # rasterio.DatasetReader object
band1 = ds.read(1) # read the first band
```

## Writing Python UDF to work with raster data

You can write Python UDFs to work with raster data in Python. The UDFs can take `SedonaRaster` objects as input and
return any Spark data type as output. This is an example of a Python UDF that calculates the mean of the raster data.

```python
from pyspark.sql.types import DoubleType

def mean_udf(raster):
return float(raster.as_numpy().mean())

sedona.udf.register("mean_udf", mean_udf, DoubleType())
df_raster.withColumn("mean", expr("mean_udf(rast)")).show()
```

```
+--------------------+------------------+
| rast| mean|
+--------------------+------------------+
|GridCoverage2D["g...|1542.8092886117788|
+--------------------+------------------+
```

It is much trickier to write an UDF that returns a raster object, since Sedona does not support serializing Python raster
objects yet. However, you can write a UDF that returns the band data as an array and then construct the raster object using
`RS_MakeRaster`. This is an example of a Python UDF that creates a mask raster based on the first band of the input raster.

```python
from pyspark.sql.types import ArrayType, DoubleType
import numpy as np

def mask_udf(raster):
band1 = raster.as_numpy()[0,:,:]
mask = (band1 < 1400).astype(np.float64)
return mask.flatten().tolist()

sedona.udf.register("mask_udf", band_udf, ArrayType(DoubleType()))
df_raster.withColumn("mask", expr("mask_udf(rast)")).withColumn("mask_rast", expr("RS_MakeRaster(rast, 'I', mask)")).show()
```

```
+--------------------+--------------------+--------------------+
| rast| mask| mask_rast|
+--------------------+--------------------+--------------------+
|GridCoverage2D["g...|[0.0, 0.0, 0.0, 0...|GridCoverage2D["g...|
+--------------------+--------------------+--------------------+
```

## Performance optimization

When working with large raster datasets, refer to the [documentation on storing raster geometries in Parquet format](../storing-blobs-in-parquet) for recommendations to optimize performance.

0 comments on commit b7c4881

Please sign in to comment.