Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multigeometries #255

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Multigeometries #255

wants to merge 4 commits into from

Conversation

yyu1
Copy link

@yyu1 yyu1 commented Feb 10, 2022

Here is the code to enable the "split_multi_geom" option for zonal_stats as we discussed. It reads individual polygons in a multipolygon type of geometry to save memory usage.

I only did a bare minimum of testing against the slope.tif and multipolygon.shp inside the test directory. I don't really have the data and time to do more extensive testing. So perhaps some of the users asking for this feature can do some tests.

@sgoodm
Copy link
Contributor

sgoodm commented Feb 10, 2022

In case it is useful for future users looking to address memory issues, I submitted a PR a while back that takes this concept one step further and can split polygons into smaller polygons to reduce memory usage (then aggregates stats back at the end to original polygons).

#154

My fork with this feature is a bit outdated and includes some other features required for my application, but if needed it is available here: https://github.com/aiddata/python-rasterstats

Copy link
Owner

@perrygeo perrygeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yyu1

I'd like to see some tests for this functionality before merging. Specifically, we need to include a test that exercises the new code branch and asserts that the results of a multipolygon are equal, regardless of the split_multi_geom setting.

Some additional comments below

src/rasterstats/main.py Outdated Show resolved Hide resolved
fsrc = np.append(fsrc, polygon_fsrc.array)
rv_array = np.append(rv_array, polygon_rv_array)
isnodata = np.append(isnodata, polygon_isnodata)
masked = np.ma.MaskedArray(fsrc, mask = (isnodata | ~rv_array))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to how this saves any memory. My interpretation is this code is an alternate method for creating the masked array, but in both cases they will be same size regardless of the split_multi_geom value. In other words, where do the memory savings actually occur?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be easier to explain with a diagram but it seems I can't upload a picture here, so let me try with description.

I'll use the example of a raster with 4 polygons that are each the size of 1 pixel, making this easier to imagine.

Let's say the raster is a 10 x 10 raster, with the 4 polygons falling on the pixels (0,0), (9,0), (0,9), and (9,9). So each polygon is at the corner of the raster. Even though the polygons only cover 4 total pixels geom_bounds = tuple(geom.bounds) will return the bounds of (0,0), (9,0), (0,9), and (9,9), because it will find a rectangle that includes all 4 polygons. Then rast.read(bounds=geom_bounds, boundless=boundless) is going to read the entire 10x10 raster into a 10x10 array, and it will also create a 10x10 mask array. This will end up using 2x(size of raster) amount of memory.

If you break the multipolygon into 4 separate polygons, then each call to geom_bounds = tuple(geom.bounds) will only return 1 pixel. Then you will end up having 4x(1 pixel) amount of memory to hold the raster of the polygons, and a 4x(1 pixel) amount of memory to hold the mask.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got a chance to look at this more closely and I see the optimization now, thanks for the explanation @yyu1!

I'd suggest adding a test to make sure the results are equal. This almost works, there are some floating point differences that you'll have to account for but otherwise they are equivalent:

def test_multipolygons_split():
    multipolygons = os.path.join(DATA, 'multipolygons.shp')
    stats = zonal_stats(multipolygons, raster, stats="*")
    stats2 = zonal_stats(multipolygons, raster, stats="*", split_multi_geom=True)
    assert stats == stats2

Our test multipolygon actually makes a great example due to the null space between the two parts.
The difference in memory is roughly 4x for this multipolygon! screenshot from QGIS:

The only problem with this approach is that it changes the shape of the masked array. While this doesn't affect the built-in stats, there are two other parts of the API that could break here:

  • If the user provides a zone_func custom statistic, the logic could break in subtle ways - for instance if a custom stat used the array dimensions in some calculation.
  • If the user turns on the raster_out option (which outputs raster datasets based on masked) it would be broken as this requires a 2D array with geospatial referencing.

So we have a few options to proceed. How about this?

  • If we get an add_stats argument and split_multi_geom == True, send a warning that your custom stats function can't assume anything about the masked array dimensions. This way it works but the user is aware of the limitation.
  • If we get a raster_out == True and split_multi_geom == True, we should raise an Exception saying that these two options are mutually incompatible.

Of course a test to assert the above behavior would also be appreciated. I can help with that if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants