Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multigeometries #255

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 56 additions & 19 deletions src/rasterstats/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ def gen_zonal_stats(
raster_out=False,
prefix=None,
geojson_out=False,
boundless=True, **kwargs):
boundless=True,
split_multi_geom=False, **kwargs):
"""Zonal statistics of raster values aggregated to vector geometries.

Parameters
Expand Down Expand Up @@ -151,32 +152,68 @@ def gen_zonal_stats(
features_iter = read_features(vectors, layer)
for _, feat in enumerate(features_iter):
geom = shape(feat['geometry'])

if 'Point' in geom.type:
geom = boxify_points(geom, rast)

geom_bounds = tuple(geom.bounds)
#Multiple polygons found, and split_multi_geom option is enabled
#In this case, we read the polygons in one at a time and concact the raster
#values together. In order to save memory
if ('MultiPolygon' in geom.type) and (split_multi_geom):
fsrc = None
rv_array = None
isnodata = None

#iterate through individual polygons
for single_polygon in geom.geoms:
polygon_bounds = tuple(single_polygon.bounds)
polygon_fsrc = rast.read(bounds=polygon_bounds)

# rasterized geometry
polygon_rv_array = rasterize_geom(single_polygon, like=polygon_fsrc, all_touched=all_touched)

# nodata mask
polygon_isnodata = (polygon_fsrc.array == polygon_fsrc.nodata)

# add nan mask (if necessary)
has_nan = (
np.issubdtype(polygon_fsrc.array.dtype, np.floating)
and np.isnan(polygon_fsrc.array.min()))
if has_nan:
polygon_isnodata = (polygon_isnodata | np.isnan(polygon_fsrc.array))

if fsrc is None:
fsrc = np.ravel(polygon_fsrc.array)
rv_array = np.ravel(polygon_rv_array)
isnodata = np.ravel(polygon_isnodata)
else:
fsrc = np.append(fsrc, polygon_fsrc.array)
rv_array = np.append(rv_array, polygon_rv_array)
isnodata = np.append(isnodata, polygon_isnodata)
masked = np.ma.MaskedArray(fsrc, mask = (isnodata | ~rv_array))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to how this saves any memory. My interpretation is this code is an alternate method for creating the masked array, but in both cases they will be same size regardless of the split_multi_geom value. In other words, where do the memory savings actually occur?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be easier to explain with a diagram but it seems I can't upload a picture here, so let me try with description.

I'll use the example of a raster with 4 polygons that are each the size of 1 pixel, making this easier to imagine.

Let's say the raster is a 10 x 10 raster, with the 4 polygons falling on the pixels (0,0), (9,0), (0,9), and (9,9). So each polygon is at the corner of the raster. Even though the polygons only cover 4 total pixels geom_bounds = tuple(geom.bounds) will return the bounds of (0,0), (9,0), (0,9), and (9,9), because it will find a rectangle that includes all 4 polygons. Then rast.read(bounds=geom_bounds, boundless=boundless) is going to read the entire 10x10 raster into a 10x10 array, and it will also create a 10x10 mask array. This will end up using 2x(size of raster) amount of memory.

If you break the multipolygon into 4 separate polygons, then each call to geom_bounds = tuple(geom.bounds) will only return 1 pixel. Then you will end up having 4x(1 pixel) amount of memory to hold the raster of the polygons, and a 4x(1 pixel) amount of memory to hold the mask.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got a chance to look at this more closely and I see the optimization now, thanks for the explanation @yyu1!

I'd suggest adding a test to make sure the results are equal. This almost works, there are some floating point differences that you'll have to account for but otherwise they are equivalent:

def test_multipolygons_split():
    multipolygons = os.path.join(DATA, 'multipolygons.shp')
    stats = zonal_stats(multipolygons, raster, stats="*")
    stats2 = zonal_stats(multipolygons, raster, stats="*", split_multi_geom=True)
    assert stats == stats2

Our test multipolygon actually makes a great example due to the null space between the two parts.
The difference in memory is roughly 4x for this multipolygon! screenshot from QGIS:

The only problem with this approach is that it changes the shape of the masked array. While this doesn't affect the built-in stats, there are two other parts of the API that could break here:

  • If the user provides a zone_func custom statistic, the logic could break in subtle ways - for instance if a custom stat used the array dimensions in some calculation.
  • If the user turns on the raster_out option (which outputs raster datasets based on masked) it would be broken as this requires a 2D array with geospatial referencing.

So we have a few options to proceed. How about this?

  • If we get an add_stats argument and split_multi_geom == True, send a warning that your custom stats function can't assume anything about the masked array dimensions. This way it works but the user is aware of the limitation.
  • If we get a raster_out == True and split_multi_geom == True, we should raise an Exception saying that these two options are mutually incompatible.

Of course a test to assert the above behavior would also be appreciated. I can help with that if needed.

else:

geom_bounds = tuple(geom.bounds)

fsrc = rast.read(bounds=geom_bounds, boundless=boundless)
fsrc = rast.read(bounds=geom_bounds, boundless=boundless)

# rasterized geometry
rv_array = rasterize_geom(geom, like=fsrc, all_touched=all_touched)
# rasterized geometry
rv_array = rasterize_geom(geom, like=fsrc, all_touched=all_touched)

# nodata mask
isnodata = (fsrc.array == fsrc.nodata)
# nodata mask
isnodata = (fsrc.array == fsrc.nodata)

# add nan mask (if necessary)
has_nan = (
np.issubdtype(fsrc.array.dtype, np.floating)
and np.isnan(fsrc.array.min()))
if has_nan:
isnodata = (isnodata | np.isnan(fsrc.array))
# add nan mask (if necessary)
has_nan = (
np.issubdtype(fsrc.array.dtype, np.floating)
and np.isnan(fsrc.array.min()))
if has_nan:
isnodata = (isnodata | np.isnan(fsrc.array))

# Mask the source data array
# mask everything that is not a valid value or not within our geom
masked = np.ma.MaskedArray(
fsrc.array,
mask=(isnodata | ~rv_array))
# Mask the source data array
# mask everything that is not a valid value or not within our geom
masked = np.ma.MaskedArray(
fsrc.array,
mask=(isnodata | ~rv_array))

# If we're on 64 bit platform and the array is an integer type
# make sure we cast to 64 bit to avoid overflow.
Expand Down