Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Polygon support #819

Closed
wants to merge 6 commits into from
Closed

Add Polygon support #819

wants to merge 6 commits into from

Conversation

jonmmease
Copy link
Collaborator

Overview

This PR adds support for rasterizing Polygons

Closes #181.

For example usage, see notebook at https://anaconda.org/jonmmease/datashader_polygons_pr/notebook

GeomArray ExtensionArrays

In order to rasterize polygons efficiently, we need a data structure that can store an array of polygon definitions in a form that is directly accessible to numba.

To accomplish this, I added a new RaggedArray (see #687) subclass called PolygonsArray. Each element of this array can store one or more polygons with holes. So elements of a PolygonArray are roughly equivalent to a shapely Polygon or MultiPolygon. The entire PolygonsArray is roughly equivalent to a geopandas GeometryArray of Polygon/MultiPolygon elements.

The new PolygonsArray pandas extension array could eventually grow to support many of the operations supported by the geopandas GeoSeries. The advantage would be that these operations could be implemented in vectorized form using numba for potentially improved performance, and Dask DataFrame support would largely come for free.

To demonstrate the potential, I added length and area properties to the PolygonArray class. These operations are ~8x faster than the equivalent GeoSeries operations, and they could also be naturally parallelized using Dask for large datasets.

Canvas methods

New Canvas.polygons() method has been added to rasterize polygons, and the Canvas.line() method has been updated to support these new geometry arrays, making it easy to draw polygon outlines.

Examples

For code and timing, see https://anaconda.org/jonmmease/datashader_polygons_pr/notebook

texas

world

world_outline

cc @jbednar @philippjfr

 - New PolygonsArray/LinesArray/PointsArray extension arrays
 - New Canvas.polygons() method to rasterize polygons
 - Updates to Canvas.points and Canvas.lines to accept new geometry arrays.
@philippjfr
Copy link
Member

This looks great! Particularly glad to see the conversion functions from geopandas. My main question is about all the extension array work and how that overlaps with the work that is going on within geopandas itself. Do we want to continue maintaining these extension arrays? Do we push to merge the implementations with geopandas? Do we work with a project like awkward-array to get a shared implementation across all the different projects?

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

Do we push to merge the implementations with geopandas?

This work is currently living in Datashader, but one of the motivations was to have (somewhere, not sure yet!) an implementation of Shapely-like processing in Dask and Pandas that is fully independent of the geospatial stack that GeoPandas uses, i.e. Shapely, GEOS, GDAL, etc. Being independent of that stack makes it much more flexible and easier to deploy on a variety of systems, and so I don't think we should be trying to move this code into GeoPandas itself. That said, it would be possible to abstract some of it into something that this code and GeoPandas can both depend on, but that doesn't rely on the legacy stack. I don't know how feasible that is, but it does sound desirable!

In the meantime, we'll have the limitation that we don't have a standard way to persist these objects, which means that for I/O we are still tied to the legacy stack. :-(

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

As for awkward-array, it definitely looks interesting and having it be externally maintained is very attractive, but I haven't yet seen anything significant it would add for these use cases. E.g. I don't see any persistence format in https://github.com/scikit-hep/awkward-1.0 yet, and the operations to be supported here don't seem to have much overlap with those proposed for awkward...

@philippjfr
Copy link
Member

philippjfr commented Oct 23, 2019

That said, it would be possible to abstract some of it into something that this code and GeoPandas can both depend on, but that doesn't rely on the legacy stack.

Yes, that was my hope, duplicating these things all over the place does not seem ideal (even if it's fine for the time being).

In the meantime, we'll have the limitation that we don't have a standard way to persist these objects, which means that for I/O we are still tied to the legacy stack. :-(

Right there are quite a few issues that are unsolved which is why I would like to push towards some shared implementation that will solve all these issues, and awkward-array seems like the solution most likely to get there particularly if we encourage it in some form. It has a community of hundreds of folks which will be relying on it for their daily work.

I don't see any persistence format in https://github.com/scikit-hep/awkward-1.0

Quoting the README: "Interoperability with ROOT, Arrow, Parquet and Pandas, as well as planned interoperability with Numba and Dask.", with parquet and arrow both being persistence formats. In the longer term goals it also states: "Translation to and from Apache Arrow and Parquet in C++."

the operations to be supported here don't seem to have much overlap with those proposed for awkward...

This is absolutely true but the idea as I understood it is that it would allow implementing numba CPU and GPU kernels on top of the arrays, which could then be exposed as functions or on custom array subclasses.

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

Interoperability with...

At this point those are all just plans, though. E.g. they mention Parquet export, but Parquet doesn't have any support for ragged arrays that I know of, so presumably they only mean export of non-ragged arrays to Parquet. Extending Parquet (via Arrow/Fletcher?) to support ragged data is a separate task and necessary with or without Awkward, as far as I can see.

would allow implementing numba CPU and GPU kernels

How would that differ from what is already implemented here? Doesn't this ExtensionArray already support those?

@philippjfr
Copy link
Member

Yes, I think this would definitely be a longer term plan.

How would that differ from what is already implemented here? Doesn't this ExtensionArray already support those?

Good question I don't have an answer to.

@philippjfr
Copy link
Member

so presumably they only mean export of non-ragged arrays to Parquet.

This seems very unlikely to me since almost all their data has ragged arrays in some form, so parquet support without that would be entirely useless for them.

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

parquet support without that would be entirely useless for them.

Maybe that's a question you could raise with that community, because Parquet is very explicitly a columnar storage format (it's the first line on their homepage!), and a Google search for parquet+ragged only comes up with awkward, not anything in Parquet.

@philippjfr
Copy link
Member

philippjfr commented Oct 23, 2019

Worked fine in the old awkward already so I don't see why it wouldn't work in the new one. In the end ragged arrays are just stored as columns with offsets to compute the start and stop of each ragged chunk:

Screen Shot 2019-10-23 at 3 21 29 PM

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

In the end ragged arrays are just stored as columns with offsets to compute the start and stop of each ragged chunk:

Hmm. I can't tell if that would work here or not. What we need is columnar, but with one chunk per row, and each row lining up with other Pandas columns containing e.g. one scalar per row. I can see how this column works, but not how it maps to other columns in the same Parquet file.

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

Looks like it does come back organized that way when your file above is read into Pandas, resulting in an object column:

>>> pd.read_parquet("test.parq")
                                                    
0  [0.7556205061410939, 0.6017337297767158, 0.162...
1  [0.021352689589037355, 0.8700607878641867, 0.2...
2  [0.42894486202061544, 0.6228601294191625, 0.81...

So maybe it will work, just not getting an object column in our case.

@jonmmease
Copy link
Collaborator Author

Cross reference #720 (comment) regarding extension array serialization.

For the geometry constructs, I think we only really need one level of nesting, which is what the current RaggedArray supports. This maps 1-to-1 with the arrow ListArray type, which already has parquet serialization support through pyarrow. So serialization to parquet should already be possible. Once the pandas and pyarrow teams settle on metadata conventions it should also be possible to read a parquet file into a pandas DataFrame and automatically restore the extension arrow columns.

@philippjfr
Copy link
Member

Great to hear!

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, at the code level. Let's have an audio chat to go over the approach and open issues, so that I can be sure I understand the detailed implications of the implementation (e.g. what happens to a polygon that gets partially aggregated into multiple pixels), the implications of using paired coordinates, possible storage formats, detecting that this special Inf convention is in use, etc.

@@ -11,7 +11,7 @@
from .pipeline import Pipeline # noqa (API import)
from . import transfer_functions as tf # noqa (API import)
from . import data_libraries # noqa (API import)

from . import geom # noqa (API import)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module name geom could be confusing here, as we have both spatial and geo already. The contents of geo.py should probably be moved into the other two, so let's ignore that one. spatial is full of functions for working with aggregate arrays, whereas geom is full of functions for doing the aggregations. geom vs. spatial doesn't convey that distinction, to me. I can't immediately think of a better name, though.

datashader/core.py Outdated Show resolved Hide resolved
@philippjfr
Copy link
Member

Btw, I'm totally not objecting to the idea that this implementation is the one we continue to use internally. I'm just saying that it'd be a shame if it was used exclusively by datashader and lots of other people end up duplicating the work. So I'm just suggesting that we should either move it out of datashader eventually OR switch to using other implementation (but only if it makes sense to do so).

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

Oh, I definitely don't want the underlying ragged array support to stay in Datashader, so we're on the same page there. It's here until we figure out what to do with it, to avoid committing to a particular implementation until we are really ready for that. The ideal is if it can migrate to a suitable project already maintained by others, and barring that I think it would need to be in its own project that can have a life outside of Datashader.

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

For the geometry constructs, I think we only really need one level of nesting, which is what the current RaggedArray supports. This maps 1-to-1 with the arrow ListArray type, which already has parquet serialization support through pyarrow.

That's a really good reason to stick with the +Inf/-Inf convention that lets us use flat arrays rather than lists of arrays. Good thing we went with that; whew!

@jonmmease
Copy link
Collaborator Author

Yeah, I agree that it would make sense to eventually move these geometry focused extension arrays into a separate package. But, I'm not familiar enough with the communities around GeoPandas to know whether there are any existing packages that would be a good fit.

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

@jonmmease , can you add an example above of plotting the same data as filled polygons, polygon outlines, and isolated points? The first two are illustrated, but not the third.

@jbednar
Copy link
Member

jbednar commented Oct 23, 2019

I'm not familiar enough with the communities around GeoPandas to know whether there are any existing packages that would be a good fit.

I'm pretty sure @jorisvandenbossche will be able to help guide us in that...

@jonmmease
Copy link
Collaborator Author

can you add an example above of plotting the same data as filled polygons, polygon outlines, and isolated points?

Sure, I updated the notebook at https://anaconda.org/jonmmease/datashader_polygons_pr/notebook with examples of lines and points.

@jbednar
Copy link
Member

jbednar commented Oct 24, 2019

Perfect, thanks! I figured there might need to be some small fix to support that. :-)

jonmmease and others added 3 commits October 24, 2019 18:24
Co-Authored-By: James A. Bednar <jbednar@users.noreply.github.com>
@jorisvandenbossche
Copy link

Thanks for pinging me!
(first answering on some other aspects of the discussion above)

In the meantime, we'll have the limitation that we don't have a standard way to persist these objects, which means that for I/O we are still tied to the legacy stack. :-(

It already has become clear in the above discussion I think, but to be explicit: parquet notably does support ragged arrays (or at least the kind of ragged arrays we are dealing with here with a regular nesting). In "parquet parlance" it is called "nested" data types, and specifically a "list" type.
Arrow also natively supports this kind of data (ListArray / ListType), and thus arrow + parquet can be a natural choice for a good serialization format of this kind of data.

As I also mentioned on one of the other issues, I think this is a good reason to choose a memory format that maps to one of arrow's types (another reason is that this then will directly work on the GPU as well, as RAPIDS is using arrow memory).
But @jonmmease already mentioned this is the case right now with the RaggedArray, so that is great.

Note that it would actually be possible to not use the inf/inf approach, but to use a second level of nesting like a "list of list of float array" (the data itself stays a flat array, but you get an extra set of offsets), which still perfectly maps to arrow/parquet types. Now, I don't know the code that uses this format for generating the glyphs, and it might be that the inf approach gives a better performance?

So serialization to parquet should already be possible. Once the pandas and pyarrow teams settle on metadata conventions it should also be possible to read a parquet file into a pandas DataFrame and automatically restore the extension arrow columns.

Yes, that is something I am actively working on right now, and normally, with the next arrow version, it should be possible to have a perfect roundtrip to arrow/parquet for a custom ExtensionArray.
The pandas -> arrow/parquet direction already works if you implemented ExtensionArray.__arrow_array__ on the custom extension arrays here. The reverse will (probably, this is still an open PR) work by implementing a ExtensionDtype.__from_arrow__. And with those two, a full roundtrip preserving the types should be possible.
See pandas-dev/pandas#20612 (comment) for the pandas issue, and https://issues.apache.org/jira/browse/ARROW-2428 / apache/arrow#5512 for the Arrow work-in-progress. Feedback on that is very welcome! (as it is exactly this kind of applications that this new work should enable)

@jorisvandenbossche
Copy link

(and then some comments regarding the geospatial aspects)

I'm not familiar enough with the communities around GeoPandas to know whether there are any existing packages that would be a good fit.

I am also not aware of any existing packages in this space in Python. But, I think it would certainly be useful to have this functionality somewhere (and not semi-hidden in datashader). Where the best place would be, not directly sure, will think a bit more about this (short term it is of course perfectly fine to have it here).

At the moment, I think that GeoPandas would rather keep focus on geometries in sense of what is supported by GEOS, and not necessarily representations based on ragged arrays.
On the other hand, if exposing such functionality, it would be nice to have good integration with and a similar interface as geopandas (would it eg be possible to enable having a GeoDataFrame but with geopandas' GeometryArray swapped with a PolygonArray? That could be an interesting option to explore, where the new array types could still live in a side project)

Regarding performance:

The new PolygonsArray pandas extension array could eventually grow to support many of the operations supported by the geopandas GeoSeries. The advantage would be that these operations could be implemented in vectorized form using numba for potentially improved performance, and Dask DataFrame support would largely come for free.

To demonstrate the potential, I added length and area properties to the PolygonArray class. These operations are ~8x faster than the equivalent GeoSeries operations, and they could also be naturally parallelized using Dask for large datasets.

We are working on better performance in GeoPandas in the PyGEOS project (https://github.com/pygeos/pygeos/), so for comparisons, that is also an interesting target. I took some comparisons from Jon's notebook and redid the comparison of the geopandas method but now with pygeos:

In [19]: import pygeos 

In [19]: arr = pygeos.from_wkb(geopandas.array.to_wkb(world.geometry.array))

In [20]: length_gp = %timeit -o world.geometry.array.length 
696 µs ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [21]: length_pg = %timeit -o pygeos.length(arr)
156 µs ± 6.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [22]: print("pygeos.length speedup: %.2f" % (length_gp.average / length_pg.average))
pygeos.length speedup: 4.47

In [23]: area_gp = %timeit -o world.geometry.array.area 
709 µs ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: area_pg = %timeit -o pygeos.area(arr)  
118 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [25]: print("pygeos.area speedup: %.2f" % (area_gp.average / area_pg.average))
pygeos.area speedup: 6.01

I didn't compare with this PR on my laptop, but compared to geopandas, there is also a considerable speed up with pygeos. The numba-solution based on the ragged array representation in this PR still gives a slightly bigger speed-up, but with pygeos the difference gets much smaller.
Now that said, the implementation in this PR could maybe be further optimized (further optimizing pygeos means optimizing GEOS C++, which is a harder target). And there will be cases where a custom implementation can certainly be a lot faster than GEOS (distances between points is a notable one ..). And of course there are also other reasons to want to have an array-based, non-GEOS representation of geometries (such as the specific usecase here in datashader, avoiding dependencies, ..). Just don't underestimate the complexity of much of the functionality that now GEOS provides for us.

For those conversions between GEOS-based geopandas and ragged-array based ExtensionArrays here: to have the best possible performance for those conversions, it might be useful to see if certain aspects can be implemented in PyGEOS.
In a quick experiment, I check getting the flat numpy array of coordinates from the "countries" dataset (the example used in Jon's notebook): when using pygeos instead of the PolygonsArray.from_geopandas method of this PR, the time goes down from 45ms to 150µs (factor 300). See https://nbviewer.jupyter.org/gist/jorisvandenbossche/de8bfd1c1a7f62e25fd6f29bd9c3c0e3. Now to be clear, the pygeos method I used only gives you the coordinates, not the array of indices. But at least it shows a faster method is possible (I think also returning an array of indices in pygeos should only give a limited slowdown).

@jonmmease
Copy link
Collaborator Author

Thanks for taking the time to weigh in @jorisvandenbossche!

Note that it would actually be possible to not use the inf/inf approach, but to use a second level of nesting like a "list of list of float array"

Oh cool, I didn't realize that you could have multiple nested levels like that. The way things work right now is that the +inf/-inf values are used to indicate whether the polygon that follows is in counter clockwise (filled) or clockwise (hole) order. I'll think a bit on whether there are any particular pros/cons to these representations.

Yes, that is something I am actively working on right now...

Awesome, thanks for taking that on!

We are working on better performance in GeoPandas in the PyGEOS project...

Yes, I saw your announcement for PyGEOS and that looks like a great step forward in terms of performance. And based on your benchmarking, it looks like working at the level of PyGEOS for the geopandas <-> ragged conversions could be very beneficial.

Now that said, the implementation in this PR could maybe be further optimized

By parallelizing the numba implementation of area/length I was able to get a significant speedup over the single threaded version timed above. Do you know if it would be possible to run the PyGEOS ufuncs in a multi-threaded context?

I am also not aware of any existing packages in this space in Python. But, I think it would certainly be useful to have this functionality somewhere (and not semi-hidden in datashader).

@jbednar @jorisvandenbossche Here are some notes I started on what a new project might look like that extracts and generalizes the geometric/spatial concepts that Datashader needs. There would certainly be a long tail to implement the various GEOS features on the ragged representation in numba, but I expect it could be useful before everything is fully fleshed out. Let me know if you have any other general thoughts on this approach and how it fits in the ecosystem.

holoviz/spatialpandas#1

Thanks!

@jorisvandenbossche
Copy link

The way things work right now is that the +inf/-inf values are used to indicate whether the polygon that follows is in counter clockwise (filled) or clockwise (hole) order. I'll think a bit on whether there are any particular pros/cons to these representations.

I thought the np.inf was used to mark the beginning of a new part of a MultiLineString/MultiPolygon and the -np.inf to mark beginning of the interiors (holes) of a single Polygon (in shapely terminology)? But writing this, I suppose that means the same in the end (if the winding of your shapely polygons / input is correct).

It could be similar to how eg GeoJSON does it with nested lists: inside the list for a single geometry, the first list is the exterior, and any following lists are holes, if they are present. And for a MultiPolygon, you then get a list of such nested lists. You get a lot of (virtual) nesting then (but the actual coordinates are still just a flat array), where you have offsets to the beginning of each MultiPolgyon, and then offsets to beginning of each polygon inside that, and then offsets to the parts (exterior / holes).

Do you know if it would be possible to run the PyGEOS ufuncs in a multi-threaded context?

Yes, that is possible. There is still some work to get there (releasing the GIL, I have have POC branch that does this for some operations), but I did a quick experiment on it: https://jorisvandenbossche.github.io/talks/2019_FOSS4GBE_pygeos/#32

Here are some notes I started on what a new project might look like that extracts and generalizes the geometric/spatial concepts that Datashader needs.

Thanks! I will try to comment on that issue shortly.

@jonmmease
Copy link
Collaborator Author

I thought the np.inf was used to mark the beginning of a new part of a MultiLineString/MultiPolygon and the -np.inf to mark beginning of the interiors (holes) of a single Polygon (in shapely terminology)? But writing this, I suppose that means the same in the end (if the winding of your shapely polygons / input is correct).

Yeah, that's correct. In this PR I'm using the winding number fill algorithm and assuming that the main polygon is CCW and holes are CW. I run the shapely polygons through shapely.polgon.orient during import to make sure this is true. It occurs to me that this might be part of why the import is a bit slow.

It could be similar to how eg GeoJSON does it with nested lists: inside the list for a single geometry, the first list is the exterior, and any following lists are holes, if they are present. And for a MultiPolygon, you then get a list of such nested lists. You get a lot of (virtual) nesting then (but the actual coordinates are still just a flat array), where you have offsets to the beginning of each MultiPolgyon, and then offsets to beginning of each polygon inside that, and then offsets to the parts (exterior / holes).

For this approach, are you picturing that the extension array itself would store the polygons and holes as nested Arrow ListArrays, rather than numpy arrays? Or just that this format could be used for parquet serialization? In either case, it's not clear to me how to get the offsets back out of a pyarrow ListArray. For example

import pyarrow as pa
la = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 5])
la
<pyarrow.lib.ListArray object at 0x7f3535fbcbe8>
[
  [
    1,
    2,
    3
  ],
  [
    4,
    5
  ]
]

There's a values property to get the flat array out.

la.values
<pyarrow.lib.Int64Array object at 0x7f3535fbcb28>
[
  1,
  2,
  3,
  4,
  5
]

But I don't see how to get the [0, 3, 5] offsets array back out. Looking quickly at the pyarrow source, it looks like maybe this isn't implemented in Python yet.

https://github.com/apache/arrow/blob/89b5a2484161c9cab7bb33d899d696729f119126/python/pyarrow/array.pxi#L1200

Apart from this implementation question, it would be pretty compelling if we could come up with a canonical ragged representation of homogenous geometry arrays and if PyGEOS could efficiently create these arrays directly from geopandas geometry arrays.

@jonmmease
Copy link
Collaborator Author

But I don't see how to get the [0, 3, 5] offsets array back out

Ahh, never mind. I see that I can get at it through the chunk's buffers.

@jorisvandenbossche
Copy link

I run the shapely polygons through shapely.polgon.orient during import to make sure this is true. It occurs to me that this might be part of why the import is a bit slow.

Ah, yes, indeed. From a quick profile, it seems that 2/3rd of the time is coming from orient. And so that part was not included in my pygeos timing. From a quick look at the source of orient, a pygeos version can probably be faster as well. Opened an issue for that here: pygeos/pygeos#71

For this approach, are you picturing that the extension array itself would store the polygons and holes as nested Arrow ListArrays, rather than numpy arrays? Or just that this format could be used for parquet serialization?

Yes, just that it can be used for serialization, if you use numpy arrays with the same layout (so without depending directly on pyarrow). Eg if we store the arrays here in a PolygonArray, we could easily construct the arrow array with pa.ListArray.from_arrays when doing serialization.

But I don't see how to get the [0, 3, 5] offsets array back out

Ahh, never mind. I see that I can get at it through the chunk's buffers.

Yes, but that is not very user friendly I think. We should probably improve the interface in pyarrow to have easier access to the offsets. I opened https://issues.apache.org/jira/browse/ARROW-7031 for this.

@jonmmease
Copy link
Collaborator Author

Thanks for opening those issues @jorisvandenbossche!

Yes, just that it can be used for serialization, if you use numpy arrays with the same layout (so without depending directly on pyarrow). Eg if we store the arrays here in a PolygonArray, we could easily construct the arrow array with pa.ListArray.from_arrays when doing serialization.

Ok, yeah. That makes sense. I've also been studying @xhochy's fletcher pyarrow extension array, and I'm wondering if it would make sense to ditch our RaggedArray class all together and have the geometry extension arrays subclass FletcherArray. Basically just having the geometry array subclasses perform some extra validation in the constructor to make sure the input arrays have the right number of nested levels and a reasonable inner element type. The nice thing is that this would remove most of the pandas extension array logic / testing.

It would make the polygon logic here dependent on fletcher and pyarrow, so that would need to be considered. @jbednar do you have any opinion on the code reuse / dependency tradeoff in this case?

@jbednar
Copy link
Member

jbednar commented Oct 30, 2019

@jonmmease , I'll leave that to your discretion; I don't have any inside knowledge of it other than that I believe @mrocklin proposed using fletcher originally for this purpose. I don't see any problematic dependencies...

@jorisvandenbossche
Copy link

I don't know to what extent you want pyarrow as a hard dependency for this functionality, it is still a quite big dependency.
Although I am actually somewhat biased here to use pyarrow, I am not sure there is already that much added value for directly using pyarrow, since there is not much functionality of pyarrow that you would use (except for storing the arrays).

In the end, for much of the functionality you implement in datashader, you need access to the actual data (flat array + offsets): eg looping through the inner lists in numba to create glyphs.
So whether you store the array + offsets yourself in an ExtensionArray, or directly in a pyarrow.ListArray (wrapped in an ExtensionArray), for the code using this extension array in datashader, it doesn't matter much I think.

Of course, directly using pyarrow might make the implementation of the PolygonArray a bit simpler.
And also, most people will probably want to use this with good serialization trough eg parquet, and in that case they need to have pyarrow installed anyway.

@mrocklin
Copy link

I believe @mrocklin proposed using fletcher originally for this purpose

Fletcher does text things. Perhaps I was referring to using Pandas Extension Arrays (of which fletcher is an example) ?

@jorisvandenbossche
Copy link

In addition the text functionality, fletcher also has a base FletcherArray (ExtensionArray that wraps a pyarrow Array) that could be used as a base class to inherit from for a LineArray/PolygonArray/..

Now, I am not familiar enough with fletcher's development to know it is already stable enough to use as dependency.

@jonmmease
Copy link
Collaborator Author

Now, I am not familiar enough with fletcher's development to know it is already stable enough to use as dependency.

@xhochy, if you have a moment to chime in. At this point in time, do you have an opinion on whether you would generally encourage or discourage a library from subclassing FletcherArray as a way to create a custom pandas extension array that maps to an arrow type (e.g. ListArray in this case)? Thanks!

@sdc50
Copy link

sdc50 commented Oct 30, 2019

I'm very excited about this work. I was playing around with applying this to a project that fell by the wayside a couple of years ago. Here are some initial results:

image

This dataset has just over 800K polygons. As you can see converting the dataset from GeoPandas takes a significant amount of time. @jonmmease you mentioned that this may be because of the running the geometries through shapely to ensure orientation is correct. I wonder if that check could be optional, and turned off for data that has been preprocessed. I'm less familiar with the other formats being discussed, but if a faster IO approach could be supported that would be great!

@jonmmease
Copy link
Collaborator Author

Hi @sdc50, thanks for chiming and sharing what you're seeing. Yeah, I think we could have something like an orient kwarg to from_geopandas. And I suspect we'll be able to get this import time down a fair bit more if we work out how to rely on PyGEOS for the import.

Apart from data import time, what are you're impressions of the rendered result and the performance? Do you have anything to compare it to? Do the speckled locations toward the lower left make sense?

Also, you should be able to parallelize the aggregation step using Dask if you build a Dask DataFrame from you're initial pandas DataFrame using dask.dataframe.from_pandas.

Thanks!

@jbednar
Copy link
Member

jbednar commented Oct 30, 2019

I don't know to what extent you want pyarrow as a hard dependency for this functionality, it is still a quite big dependency.

To make that concrete, here are the additional conda packages that get installed when you make an datashader environment with pyvarrow, beyond what you get with datashader itself:

arrow-cpp
boost-cpp
brotli
bzip2
double-conversion
gflags
glog
icu
libboost
libevent
libiconv
libprotobuf
lz4-c
pyarrow
re2
snappy
thrift-cpp

A few of those are no big deal, and already likely to be in a real environment anyway (snappy is used in many Datashader examples, bzip2 and lz4-c are useful for anyone reading input files, and some others look straightforward), but others do look substantial and potentially problematic. That said, only a fraction of our users need the ragged array support, and so we can certainly consider requiring pyarrow just for those users, if it increases interoperability and reduces our maintenance burden and codebase size or complexity. Jon, I think I have to leave that decision to you; please go with whichever choice you feel most comfortable with, keeping all the above in mind!

@sdc50
Copy link

sdc50 commented Nov 1, 2019

Apart from data import time, what are you're impressions of the rendered result and the performance? Do you have anything to compare it to? Do the speckled locations toward the lower left make sense?

The rendered result look great, and is much faster than any other tool I've used (ArcGIS) takes several minutes to visualize this dataset and it has to re-render anytime you do anything to the map. Rending this in a notebook with GeoViews takes hours/days (if it doesn't crash).

The speckled appearance is accurate for the data. In fact the whole thing is much more speckled than it appears with the blue shade color map. This render helps to show how speckled the whole area is:

image

It would now be useful to be able to zoom in to explore the data better. Is there a way to plot this with Bokeh so the zoom/pan controls would be available. I'd also like to be able to overlay this on some map tile. I've done that with other datasets using GeoViews/HoloViews, but can't work out how to do that with this data type.

@philippjfr
Copy link
Member

I've done that with other datasets using GeoViews/HoloViews, but can't work out how to do that with this data type.

Once there is agreement that the data format for this is finalized I'll be opening a HoloViews PR to add support for this glyph type, so you'll be able to be able to use the usual approaches.

@jonmmease
Copy link
Collaborator Author

@jorisvandenbossche, quick question on __arrow_array__ serialization. It looks like pyarrow doesn't currently support returning a ChunkedArray in __arrow_array__, which is what fletcher uses as the storage data structure. Would this be difficult to support at the pyarrow level?

@xhochy
Copy link

xhochy commented Nov 4, 2019

Thanks for taking a look at fletcher. First and foremost: I'm actively looking to develop it further and also happy to spent some cycles to make it more useful for others. Important things to note about it:

  • fletcher started out as a library for faster strings but nowadays supports any pyarrow-backed array type. My main personal use case for it will be efficient string and boolean arrays, I'm not seeing other types where I have the need at the moment.
  • I would see it as a place where we could develop general algorithms on top of pyarrow-backed ExtensionArrays. Probably we should keep it limited to general purpose algorithms but I'm happy to temporarily house geo-related algorithms if that makes development simpler.
  • The long list of dependencies of the conda-forge pyarrow package can be reduced by splitting the recipe into multiple outputs. I plan to do this at some time but no work has been done there yet. This will only happy for the conda package, not the wheel as pip/setuptools isn't sophisticated enough to handle this. This is meant as an FYI if the list of dependencies becomes a problem in future.
  • Long-term I hope that most of it gets absorbed into pandas and pyarrow. While this is not the case, it is there to faster iterate on getting the missing pieces together and find out what is best to go into the both mentioned packages.
  • At the moment only the latest "major" version (i.e the latest 0.x release series) of pyarrow and pandas is supported and I plan to keep it that way as things are rapidly improving in both pyarrow's and the pandas.ExtensionArray API. Supporting older version will be quite of a lot of a burden. Happy to accept PRs from the outside but I doubt that anyone will go the extra effort.

Also I see that @jonmmease has started prototyping https://github.com/jonmmease/spatialpandas, I guess this on top of fletcher seems like a good idea.

@jonmmease
Copy link
Collaborator Author

Hi @xhochy, thanks for taking the time. That was very helpful. As I mentioned in #826 and holoviz/spatialpandas#2, after a bunch of prototyping I decided not to subclass FletcherArray for the time being. The class works great, and I got pretty far down the road using the subclass approach. I just ran into some issues that I think stem from the fact that I want __getitem__(int_key) to return instances of a custom class wrapping the arrow result. I ended up using FletcherArray as a helpful reference in writing a custom ExtensionArray that wraps a numeric ListArray. This "numeric ListArray" restriction also helped simplify the required logic a bit (e.g. no string or ChunkedArray handling). Thanks again!

@jonmmease
Copy link
Collaborator Author

@sdc50

The rendered result look great, and is much faster than any other tool I've used (ArcGIS) takes several minutes to visualize this dataset and it has to re-render anytime you do anything to the map. Rending this in a notebook with GeoViews takes hours/days (if it doesn't crash).

That's great news! Thanks for taking the time to try this out and report back.

This dataset has just over 800K polygons. As you can see converting the dataset from GeoPandas takes a significant amount of time. @jonmmease you mentioned that this may be because of the running the geometries through shapely to ensure orientation is correct. I wonder if that check could be optional, and turned off for data that has been preprocessed

I think the best way forward here is probably to build import export capabilities on top of (or in) the PyGEOS library. For the time being, we can at least add the option to disable the orient logic. I did this in the new extension arrays that will replace those in this PR at holoviz/spatialpandas#2.

@jonmmease
Copy link
Collaborator Author

This PR is now superseded by #826 and holoviz/spatialpandas#2.

With this approach the extension arrays are built on nested pyarrow ListArrays as part of the new (not yet published) spatialpandas library, and there is a separate extension array corresponding to each shapely/geopandas shape class. With the multi-level nesting support, nan/+inf/-inf values are no longer needed as separators which cleans up the Datashader rendering logic a bit.

Efficient serialization to parquet already works, and reading from parquet will be possible when such support is added to pandas.

Thanks for the discussion all. I'm going to close this PR and we can continue on #826 regarding polygon rendering and on holoviz/spatialpandas#2 regarding the geometric extension arrays.

@jonmmease jonmmease closed this Nov 5, 2019
@jbednar jbednar mentioned this pull request Nov 24, 2020
5 tasks
@maximlt maximlt deleted the polygon_support branch December 25, 2021 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add polygon support
7 participants