Multigeometries #255

yyu1 · 2022-02-10T09:40:36Z

Here is the code to enable the "split_multi_geom" option for zonal_stats as we discussed. It reads individual polygons in a multipolygon type of geometry to save memory usage.

I only did a bare minimum of testing against the slope.tif and multipolygon.shp inside the test directory. I don't really have the data and time to do more extensive testing. So perhaps some of the users asking for this feature can do some tests.

…polygons one at a time to conserve memory usage

sgoodm · 2022-02-10T13:26:27Z

In case it is useful for future users looking to address memory issues, I submitted a PR a while back that takes this concept one step further and can split polygons into smaller polygons to reduce memory usage (then aggregates stats back at the end to original polygons).

#154

My fork with this feature is a bit outdated and includes some other features required for my application, but if needed it is available here: https://github.com/aiddata/python-rasterstats

perrygeo

Thanks @yyu1

I'd like to see some tests for this functionality before merging. Specifically, we need to include a test that exercises the new code branch and asserts that the results of a multipolygon are equal, regardless of the split_multi_geom setting.

Some additional comments below

src/rasterstats/main.py

perrygeo · 2022-02-12T17:11:00Z

src/rasterstats/main.py

+                        fsrc = np.append(fsrc, polygon_fsrc.array)
+                        rv_array = np.append(rv_array, polygon_rv_array)
+                        isnodata = np.append(isnodata, polygon_isnodata)
+                masked = np.ma.MaskedArray(fsrc, mask = (isnodata | ~rv_array))


I'm a bit confused as to how this saves any memory. My interpretation is this code is an alternate method for creating the masked array, but in both cases they will be same size regardless of the split_multi_geom value. In other words, where do the memory savings actually occur?

This will be easier to explain with a diagram but it seems I can't upload a picture here, so let me try with description.

I'll use the example of a raster with 4 polygons that are each the size of 1 pixel, making this easier to imagine.

Let's say the raster is a 10 x 10 raster, with the 4 polygons falling on the pixels (0,0), (9,0), (0,9), and (9,9). So each polygon is at the corner of the raster. Even though the polygons only cover 4 total pixels geom_bounds = tuple(geom.bounds) will return the bounds of (0,0), (9,0), (0,9), and (9,9), because it will find a rectangle that includes all 4 polygons. Then rast.read(bounds=geom_bounds, boundless=boundless) is going to read the entire 10x10 raster into a 10x10 array, and it will also create a 10x10 mask array. This will end up using 2x(size of raster) amount of memory.

If you break the multipolygon into 4 separate polygons, then each call to geom_bounds = tuple(geom.bounds) will only return 1 pixel. Then you will end up having 4x(1 pixel) amount of memory to hold the raster of the polygons, and a 4x(1 pixel) amount of memory to hold the mask.

I finally got a chance to look at this more closely and I see the optimization now, thanks for the explanation @yyu1!

I'd suggest adding a test to make sure the results are equal. This almost works, there are some floating point differences that you'll have to account for but otherwise they are equivalent:

def test_multipolygons_split(): multipolygons = os.path.join(DATA, 'multipolygons.shp') stats = zonal_stats(multipolygons, raster, stats="*") stats2 = zonal_stats(multipolygons, raster, stats="*", split_multi_geom=True) assert stats == stats2

Our test multipolygon actually makes a great example due to the null space between the two parts.
The difference in memory is roughly 4x for this multipolygon! screenshot from QGIS:

The only problem with this approach is that it changes the shape of the masked array. While this doesn't affect the built-in stats, there are two other parts of the API that could break here:

If the user provides a zone_func custom statistic, the logic could break in subtle ways - for instance if a custom stat used the array dimensions in some calculation.

If the user turns on the raster_out option (which outputs raster datasets based on masked) it would be broken as this requires a 2D array with geospatial referencing.

So we have a few options to proceed. How about this?

If we get an add_stats argument and split_multi_geom == True, send a warning that your custom stats function can't assume anything about the masked array dimensions. This way it works but the user is aware of the limitation.

If we get a raster_out == True and split_multi_geom == True, we should raise an Exception saying that these two options are mutually incompatible.

Of course a test to assert the above behavior would also be appreciated. I can help with that if needed.

Yifan Yu added 3 commits February 10, 2022 00:32

added option to process multi-polygon cases by reading in individual …

cc1f5b8

…polygons one at a time to conserve memory usage

update to remove deprecated part in Shapely 1.8

6b25e4b

Copy over old variable. numpy.append does not append in-place

cd8a88e

yyu1 mentioned this pull request Feb 11, 2022

Using rasterstats with multigeometries on very large datasets #214

Open

perrygeo requested changes Feb 12, 2022

View reviewed changes

changed variable name to follow PEP8 convention

39fe278

perrygeo added the Needs Additional Info label Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multigeometries #255

Multigeometries #255

yyu1 commented Feb 10, 2022

sgoodm commented Feb 10, 2022

perrygeo left a comment

perrygeo Feb 12, 2022

yyu1 Feb 15, 2022

perrygeo Feb 18, 2022

Multigeometries #255

Are you sure you want to change the base?

Multigeometries #255

Conversation

yyu1 commented Feb 10, 2022

sgoodm commented Feb 10, 2022

perrygeo left a comment

Choose a reason for hiding this comment

perrygeo Feb 12, 2022

Choose a reason for hiding this comment

yyu1 Feb 15, 2022

Choose a reason for hiding this comment

perrygeo Feb 18, 2022

Choose a reason for hiding this comment