Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quadmesh glyph with rectilinear and curvilinear support #779

Merged
merged 24 commits into from
Aug 28, 2019

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Aug 13, 2019

This PR is an alternative implementation of quadmesh rasterization, initially based on the logic from #769.

Unlike that PR, this PR adds quadmesh glyph classes, and supports the standard datashader aggregation framework.

Overall, the architecture fits well with that of the dataframe-based glyphs (points, line, and area). And relying on the datashader aggregation framework results in a lot less code duplication compared to #769.

Thanks to the variable argument expansion from #780, the performance for rectilinear rendering is now on par with the prototype implementation.

For curvilinear quadmesh aggregation, this PR uses a raycasting algorithm for determining which pixels to fill. I've found the performance of this approach to be ~1.5x slower than the prototype implementation which uses an area-based point inclusion approach. This algorithm isn't the only difference between the implementation, and I didn't exhaustively chase down the differences this time.

I went with the raycasting algorithm because it handles concave and complex quads. It is also very straightforward to extend this algorithm to general polygons (with or without holes), so I think there's a good path here towards adding general polygon support to datashader as well.

For example usage and benchmarks, see https://anaconda.org/jonmmease/quadmeshcomparisons_pr/notebook (rendered at https://nbviewer.jupyter.org/urls/notebooks.anaconda.org/jonmmease/quadmeshcomparisons_pr/download)

Future work:
This PR does not include any parallelization support, so extending this to work in a multi-threaded or distributed context is left as future work.

@jbednar @philippjfr


Outdated performance observations from initial PR:

But, it's an order of magnitude slower than the implementations in #769. Here is a notebook showing some timing results: https://anaconda.org/jonmmease/rectquadmesh_examples/notebook.

Roughly speaking, this PR is ~13x faster than representing a rectilinear quadmesh with a trimesh. But the specialized implementation from #769 is ~13 faster than this PR. Note that I disabled numba parallelization for these tests for consistency

I did some performance debugging and I found that nearly all of the extra overhead in this PR, compared to the specialized implementation, is coming from the use of the aggregation framework. If in the _extend function, I don't call append but instead implement a single aggregation then the performance is comparable to the specialized implementations.

So the bad news is that right now we need to choose between performance and consistency/maintainability for the quadmesh implementation. The good news is that there may be an order of magnitude speedup to be had across points, line, and area glyphs as well if we can work out how to optimize the aggregation framework.

@jbednar
Copy link
Member

jbednar commented Aug 13, 2019

That's both exciting and alarming! Can you expand on "disabled numba parallelization for these tests for consistency"? Given that the final implementation will use Numba, are comparisons without Numba meaningful here?

@jonmmease
Copy link
Collaborator Author

Can you expand on "disabled numba parallelization for these tests for consistency"?

@philippjfr's implementation uses the numba parallel prange loops (https://numba.pydata.org/numba-doc/latest/user/parallel.html#explicit-parallel-loops), but the current datashader glyphs do not. I tried turning it on for this PR, but got a numba error that I didn't spend time diagnosing.

@philippjfr
Copy link
Member

I think unifying the dispatch/aggregation pipeline and internal APIs is really important and I'm really excited about this but going from a 600-700x to a 13x speedup would indeed be pretty sad :(

@jbednar
Copy link
Member

jbednar commented Aug 13, 2019

Well, it wouldn't be that drastic; presumably the Numba issue can be worked around, which would leave it off by a single factor of 13, right?

@jonmmease
Copy link
Collaborator Author

I was only seeing a factor of 2-3x speedup with parallelization in @philippjfr 's code (on a machine with many more cores than that). But, yeah, we should figure out where we can use this for the glyph generation code.

@philippjfr
Copy link
Member

I'm very confused why there is such a huge performance difference. Numba should automatically inline the append function. I'll play around with it and maybe ask Val about it when he returns next week.

@jonmmease
Copy link
Collaborator Author

Yeah, I think it's largely about inlining. And I don't have intuition of when things get inlined. The numba docs say that numba itself doesn't do it (https://numba.pydata.org/numba-doc/dev/user/faq.html#does-numba-inline-functions), but that LLVM does in some cases.

@jonmmease
Copy link
Collaborator Author

Also, the append function is constructed from evaluating a string, so perhaps that's a factor.

@philippjfr
Copy link
Member

Also, the append function is constructed from evaluating a string, so perhaps that's a factor.

I wouldn't have thought so since once it's JIT compiled there should be no difference.

@philippjfr
Copy link
Member

As an aside, I can't get the implementation in this PR to work with an x_range or y_range provided to the Canvas.

@jonmmease
Copy link
Collaborator Author

@philippjfr
Copy link
Member

Yeah, nevermind that's not the actual issue. The problem seems to be the order the dimensions are parsed in and returned as. I'd expect the DataArray dimensions to be ordered y, x on input and output, something appears to be inverting that.

@philippjfr
Copy link
Member

philippjfr commented Aug 13, 2019

Don't really understand why this happens because this code is correct:

if x is None and y is None:
    y, x = source.dims

but in this example things are really weird:

xs = np.logspace(0, 3, 50)
ys = np.arange(50)

zs = xs * ys[:, np.newaxis]
ds = hv.QuadMesh((xs, ys, zs), datatype=['xarray']).data
da = da.z

canvas = Canvas(x_range=(200, 1000), y_range=(0, 50))

print(da.dims)

agg1 = canvas.quadmesh(da, agg=ds.reductions.mean('z'))

print(agg1.dims)
print(agg1.coords)

agg2 = canvas.quadmesh(da, 'x', 'y', agg=ds.reductions.mean('z'))

print(agg2.dims)
print(agg2.coords)
('y', 'x')
('x', 'y')
Coordinates:
  * x        (x) float64 0.04167 0.125 0.2083 0.2917 ... 49.71 49.79 49.88 49.96
  * y        (y) float64 200.7 202.0 203.3 204.7 ... 995.3 996.7 998.0 999.3
('y', 'x')
Coordinates:
  * y        (y) float64 0.04167 0.125 0.2083 0.2917 ... 49.71 49.79 49.88 49.96
  * x        (x) float64 200.7 202.0 203.3 204.7 ... 995.3 996.7 998.0 999.3

@philippjfr
Copy link
Member

To summarize the issues:

  1. On the input the code inferring the dimension order appears wrong
  2. On the output the coord order matches the dim ordering, which is usually not desirable. The coordinate order is meaningful by NetCDF convention and should in the 2D case usually either be the inverse of the dim ordering OR maybe more appropriately simply inherit the ordering of the input DataArray.

@philippjfr
Copy link
Member

philippjfr commented Aug 13, 2019

Back to the inlining issue. I'm guessing the problem is that append is always two-levels deep, i.e. it will itself call another function to do the various base aggregations. I think we may be able to get it to inline properly if, instead of creating one append function, we pass in separate append function for each base aggregation, i.e. for a mean aggregation the inner loop would look something like:

for xi in range(x0i, x1i):
    for yi in range(y0i, y1i):
        append_count(j, i, xi, yi, count_agg)
        append_sum(j, i, xi, yi, sum_agg)

The actual implementation would be something like:

for xi in range(x0i, x1i):
    for yi in range(y0i, y1i):
        for append, array in zip(aggs, arrays):
            append(j, i, xi, yi, array)       

That would also get rid of the need for dynamically creating the append function from a string ( 😱) and instead directly use the _append from the base reductions (count, sum, min, max, m2, any) directly.

I'm pretty hopeful that will lead to proper inlining since it's pretty close to what my implementation was doing anyway.

Edit: I guess there are some complications in that we would also have to factor out the code that extracts the aggregated value but I still think it's doable.

@jonmmease
Copy link
Collaborator Author

The dimension flip seems to happen when converting the DataArray to a Dataset...

print(da.dims)
print(list(da.to_dataset().dims))
['y', 'x']
['x', 'y']

So I'll just be sure to grab it from the DataArray itself.

@jonmmease
Copy link
Collaborator Author

The dimension and coord ordering logic is being carried over from the default behavior of the glyph aggregators. Here's what you get from points

canvas = Canvas(x_range=(200, 1000), y_range=(0, 50))
canvas.points(pd.DataFrame({'x': [300], 'y': [20]}), 'x', 'y')
<xarray.DataArray (y: 600, x: 600)>
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)
Coordinates:
  * y        (y) float64 0.04167 0.125 0.2083 0.2917 ... 49.71 49.79 49.88 49.96
  * x        (x) float64 200.7 202.0 203.3 204.7 ... 995.3 996.7 998.0 999.3

Do you think we should flip the coord order for what Datashader returns across all of the glyph types? Or we could treat quadmesh individually because it's the only one (so far) where the input data structure is an xarray.

@philippjfr
Copy link
Member

Actually the other issue with the current append is that it does the array lookup inside the inner loop which also seems pretty inefficient.

@philippjfr
Copy link
Member

Or we could treat quadmesh individually because it's the only one (so far) where the input data structure is an xarray.

Yeah, I think we should have special logic for xarrays.

@jonmmease
Copy link
Collaborator Author

Actually the other issue with the current append is that it does the array lookup inside the inner loop which also seems pretty inefficient.

I'm going to manually inline the mean aggregate case so we can see what it looks like and what the performance difference is. I'll paste the inlined version here when I have it.

@jbednar
Copy link
Member

jbednar commented Aug 13, 2019

Isn't the raster input an xarray too? But that's a different code path right now, I suppose.

@jonmmease
Copy link
Collaborator Author

I think I've worked out an approach to automating this variable-arity optimization that works across glyphs. I'll open a separate PR with that when it's ready...

The ordering of x then y makes it possible to pass the output xarray
into a HoloViews Image container with the dimensions being transposed
@jonmmease
Copy link
Collaborator Author

For now, I updated the construction of the returned DataArray to specify a coordinate order of [x, y]. This ordering was the reason that passing an aggregate result into a HoloViews Image resulted in an inversion of the x and y axes (which was true for points and lines as well). Done in commit 87cf7e3.

@jonmmease
Copy link
Collaborator Author

@philippjfr I think I'm satisfied (in terms of correctness, API, and performance) with the rectilinear quadmesh implementation in this PR now. So feel free to kick the tires again and let me know if anything else sticks out.

I'm going to move on to the curvilinear case now.

@jbednar
Copy link
Member

jbednar commented Aug 16, 2019

So, what's the final word on the performance?

@jonmmease
Copy link
Collaborator Author

So, what's the final word on the performance?

Equivalent to the "append_expanded" line above (~40x speedup over trimesh for rectilinear case). Turns out that all of the *_parallel implementations suffer from a race condition that would require something like numba/numba#3681 to address. So prange parallelization is not enabled at this point.

@jonmmease jonmmease changed the title [WIP] Add quadmesh glyph Add quadmesh glyph with rectilinear and curvilinear support Aug 16, 2019
@jonmmease
Copy link
Collaborator Author

@jbednar @philippjfr I've added the curvilinear quadmesh support and updated the PR description to reflect the current contents of the PR. Ready for review!

@jonmmease jonmmease mentioned this pull request Aug 21, 2019
@jbednar
Copy link
Member

jbednar commented Aug 28, 2019

Merging now; hopefully we can recover the last few 12X or so of performance using Dask to make the parallel operations safe, plus some other optimizations later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants