Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for row-oriented coordinates in Canvas.line #694

Merged
merged 28 commits into from
Jan 31, 2019
Merged

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Jan 22, 2019

Overview

This PR introduces a new Canvas.lines method for rendering collections of polylines (here called "lines" for short) where each polyline corresponds to a single row in an input DataFrame. This is in contrast to the current Canvas.line method which renders a single polyline (potentially containing gaps if there are nan values) per DataFrame.

I propose that this method eventually be overloaded to support the following 3 use-cases. Note: This PR only implements (1), but I'm hopeful that the addition of (2) and (3) will not require much additional refactoring.

  1. Each line has the same length, N, and is represented by N columns containing the x coordinates and N columns containing the y coordinates. Here's an example in the case of two-vertex line segments (N=2):
cvs.lines(df, x=['x0', 'x1'], y=['y0', 'y1'])
  1. Each line has the same length, N, but either the x or y vertex coordinates are constant for all lines. A common use-case here would be time-series data, where the x values (the time bins) are constant across all lines and each row of the DataFrame contains only the y-coordinates of each line. This would nicely handle cases like the one brought up in Plotting many lines as separate columns takes a long time #286 (comment), without the need to concatenate and nan separate each line. Here's what the aggregation step for this example could look like:
data = ... # 10000 x 50 array representing 10000 lines over 50 time bins
cvs.lines(pd.DataFrame(data), y=list(range(50)), x_constants=np.linspace(0, 10, 50))

Note that the data numpy array is converted into a DataFrame before being passed to cvs.lines. This DataFrame will have columns with integer labels that match the column index, and so the y=list(range(50)) argument is specifying a list of the column labels where the column labels are integers. As suggested in #283, it would also be nice to allow DataShader to accept a numpy (or Dask) array directly without requiring this DataFrame conversion step. I see this as a somewhat orthogonal feature that would require support across several other parts of the API.

  1. Each line may have a different length. In this case the x and y vertex coordinates could be stored in a pair of RaggedArray columns (Add pandas ExtensionArray for storing homogeneous ragged arrays #687).

Additional consideration: For cases 1 and 2, we should also have a way to specify whether lines are
represented by rows (as I originally had in mind writing the descriptions above) or by columns. Representing lines by rows is useful when the line lengths are manageable, but there are lots and lots of lines. Specifying lines by columns is useful when the number of lines is manageable, but each line has lots and lots of points.

Example

Drawing line segments

import numpy as np
import pandas as pd
import dask.dataframe as dd
import datashader as ds
import datashader.utils as du, datashader.transfer_functions as tf

n = 100
np.random.seed(2)

x0 = np.random.uniform(size=n)
x1 = np.random.uniform(size=n)
y0 = np.random.uniform(size=n)
y1 = np.random.uniform(size=n)

pts = np.stack((x0,x1,y0,y1)).T
df = pd.DataFrame(np.stack((x0,x1,y0,y1)).T, columns=['x0', 'x1', 'y0', 'y1'])

cvs = ds.Canvas(plot_height=400,plot_width=400, x_range=[0, 1], y_range=[0, 1])
tf.shade(cvs.lines(df, ['x0', 'x1'], ['y0', 'y1'], agg=ds.count()))

download

Performance

Edit: See updated performance results in #694 (comment)

Here are some performance comparisons comparing cvs.lines with the equivalent cvs.line command for the segment example above. Timing does not take into account the time to generate the nan separated DataFrame needed by cvs.line.

newplot 8

And here's what scaling looks like on my quad-core 2015 MBP.
newplot 9

So the current cvs.line approach is consistently a little bit faster in the single-threaded case. But if the conversion time were taken into account this difference would probably get washed out. And for some reason the new cvs.lines approach seems to scale a little better in the multi-threaded case (at least for 3 and 4 cores).

TODO

  • Evaluate unifying this functionality with the current cvs.line method by adding a new "lines-by-rows-or-columns" argument. Either way, sketch out the future API for the by-rows and by-columns cases.
  • Rewrite last section of http://datashader.org/user_guide/3_Timeseries.html

extend_line = _build_extend_line(draw_line, map_onto_pixel)

@ngjit
def extend_lines_xy(vt, bounds, xs, ys, plot_start, *aggs_and_cols):
Copy link
Member

@philippjfr philippjfr Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a go at rewriting this by adapting the extend_line code and that seems somewhat more efficient than unpacking this format and reusing the existing codepath. Probably needs some more fixes, but here's what my 10 minutes of hacking produced:

    @ngjit
    def extend_lines_xy(vt, bounds, xs, ys, plot_start, *aggs_and_cols):
        """
        here xs and ys are tuples of arrays and non-empty
        """
        xmin, xmax, ymin, ymax = bounds
        nrows = xs[0].shape[0]
        ncols = len(xs)
        i = 0
        while i < nrows - 1:
            plot_start = True
            j = 0
            while j < ncols - 1:
                x0 = xs[j][i]
                y0 = ys[j][i]
                x1 = xs[j+1][i]
                y1 = ys[j+1][i]

                # Use Liang-Barsky (1992) to clip the segment to a bounding box
                if outside_bounds(x0, y0, x1, y1, xmin, xmax, ymin, ymax):
                    plot_start = True
                    j += 1
                    continue

                t0, t1 = 0, 1
                dx = x1 - x0

                t0, t1, accept = clipt(-dx, x0 - xmin, t0, t1)
                if not accept:
                    j += 1
                    continue

                t0, t1, accept = clipt(dx, xmax - x0, t0, t1)
                if not accept:
                    j += 1
                    continue

                dy = y1 - y0

                t0, t1, accept = clipt(-dy, y0 - ymin, t0, t1)
                if not accept:
                    j += 1
                    continue

                t0, t1, accept = clipt(dy, ymax - y0, t0, t1)
                if not accept:
                    j += 1
                    continue

                if t1 < 1:
                    clipped = True
                    x1 = x0 + t1 * dx
                    y1 = y0 + t1 * dy

                if t0 > 0:
                    # If x0 is clipped, we need to plot the new start
                    clipped = True
                    plot_start = True
                    x0 = x0 + t0 * dx
                    y0 = y0 + t0 * dy

                x0i, y0i = map_onto_pixel(vt, bounds, x0, y0)
                x1i, y1i = map_onto_pixel(vt, bounds, x1, y1)
                draw_line(x0i, y0i, x1i, y1i, i, plot_start, clipped, *aggs_and_cols)
                plot_start = False
                j += 1
            i += 1

    return extend_lines_xy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! @jonmmease, can you benchmark that like you did above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this on a 10M vertex trimesh wireframe using dask this renders in 0.5 seconds as opposed to 1.5 seconds with cvs.line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for giving this a try @philippjfr! I'll incorporate this approach and add it to my benchmarks

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I think this will be very valuable to users, and I agree with your proposed directions 2 and 3. Very nice!

Can you please also update the user guide notebook http://datashader.org/user_guide/3_Timeseries.html to suggest using this data structure when your polylines are all of the same length, instead of the approaches proposed there now?

datashader/glyphs.py Outdated Show resolved Hide resolved
datashader/glyphs.py Outdated Show resolved Hide resolved
@jonmmease
Copy link
Collaborator Author

@jbednar for documentation I was thinking of adding a section to http://datashader.org/user_guide/4_Trajectories.html for the case where you have a collection of fixed length trajectories (just make up some more random walks).

Until we have (2) from the description above I think the last section of http://datashader.org/user_guide/3_Timeseries.html would still be a bit awkward because the user would need to create columns for the x-data that would repeat for every row. With (2) it will be a one-liner without any utility functions or work-arounds needed.

What do you think?

@jbednar
Copy link
Member

jbednar commented Jan 22, 2019

Ok, sounds good. In that case please briefly mention the new approach in 3_Timeseries, for now, and once approach 2 is implemented that section can be rewritten to use the new way.

BTW, can approach 2 be approximated already in the current PR's code simply by passing the y column n numbers of times as the y_label, for n x columns? It might not be quite as fast as having code that knows the y column is shared, but wouldn't require the user to duplicate the actual column, right?

@jonmmease
Copy link
Collaborator Author

BTW, can approach 2 be approximated already in the current codebase simply by passing the y column n numbers of times as the y_label, for n x columns? It might not be quite as fast, but wouldn't require duplicating the actual column, right?

I'm not sure, I'll give it another look when I incorporate @philippjfr 's suggestion above. And if something looks relatively straightforward, I'll just go ahead and add it.

@jbednar
Copy link
Member

jbednar commented Jan 22, 2019

Note that I'm just suggesting a (untested) way for the user to use the current code, not a change to the code...

@jonmmease
Copy link
Collaborator Author

Updated performance results after implementing optimization suggested by @philippjfr in bc7631c:

newplot 15

newplot 14

Performance is now basically equivalent for single threaded-case and is much faster for multi-threaded case.

@jonmmease
Copy link
Collaborator Author

I added a new API consideration to the main description:

Additional consideration: For cases 1 and 2, we should also have a way to specify whether lines are
represented by rows (as I originally had in mind writing the descriptions above) or by columns. Representing lines by rows is useful when the line lengths are manageable, but there are lots and lots of lines. Specifying lines by columns is useful when the number of lines is manageable, but each line has lots and lots of points.

I need to think about it a little more, but if we have an axis style argument that determines whether lines are specified by rows or columns, then we might be able to unify this functionality with the current cvs.line method, rather than introduce a new cvs.lines method.

@jonmmease
Copy link
Collaborator Author

For the sake of discussion, here is a potential API that would allow this new functionality to be integrated into the existing cvs.line API. The basic idea is to add a pandas-style axis argument that determines whether lines are being aggregated per column (axis=0) or per row (axis=1).

New method signature:

def line(self, source, x=None, y=None, agg=None, axis=0, x_constants=None, y_constants=None):
    ...

Example source is a DataFrame with N rows

df = pd.DataFrame({
    'A1': [...],
    'A2': [...],
    'B1': [...],
    'B2': [...],
    'A_ragged': RaggedArray(...),
    'B_ragged': RaggedArray(...)
})

One line across all rows

Here is an example of the current behavior of aggregating one line per DataFrame. The following example will produce a single length N line with coordinates df.A1 by df.B1

cvs.line(df, x='A1', y='B1')

or with axis specified explicitly

cvs.line(df, x='A1', y='B1', axis=0)

Multiple lines across all rows

An extension of the current behavior would be to allow x and y to be set to lists of columns in order to support aggregating multiple lines. The following example would produce two length N lines. The first with coordinates df.A1 by df.B1 and the second with coordinates df.A2 by df.B2.

cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=0)

Multiple lines across all rows with one shared dimension

If all of the lines share the same x coordinates, then this column could be specified as a string rather
than a list. The following example would produce two length N lines. The first with coordinates df.A1 by df.B1 and the second with coordinates df.A1 by df.B2.

cvs.line(df, x='A1', y=['B1', 'B2'], axis=0)

Many fixed length lines

The following example would aggregate one length-2 line per row, where the ith line has coordinates
[df.A1[i], df.A2[i]] by [df.B1[i], df.B2[i]]

cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=1)

Many fixed length lines with one shared dimension

The following example would aggregate one length-2 line per row, where the ith line has coordinates
[0.0, 1.0] by [df.B1[i], df.B2[i]]

cvs.line(df, x_constants=[0.0, 1.0], y=['B1', 'B2'], axis=1)

Many variable length lines

The following example would aggregate one line per row, where the ith line has coordinates df.A_ragged[i] by df.B_ragged[i].

cvs.line(df, x='A_ragged', y='B_ragged', axis=1)

Invalid API usage

In my mind the biggest drawback to this API is that some of the combinations of parameters will be invalid and it may be confusing for users to understand why.

For example, x_constants/y_constants couldn't be used with axis=0 because in this case we don't know the line length in advance.

Also, specifying RaggedArray columns would only make sense in the axis=1 case.

But, I expect that a good docstring and helpful error messages should be enough to help people find the options that they want.

What do you think?

@jbednar
Copy link
Member

jbednar commented Jan 25, 2019

It looks very flexible and powerful to me!

I wonder if there is some way to avoid having to introduce x_constants and y_constants as different names. I see it's not practical to deduce that [0.0, 1.0] is a constant coordinate array rather Pandas column names, since Pandas allows numeric column names. Could use tuples instead of lists, but that seems arbitrary. Maybe if it's an array, treat it as a constant axis to broadcast, but otherwise as column names?

@jonmmease
Copy link
Collaborator Author

I wonder if there is some way to avoid having to introduce x_constants and y_constants as different names

Yeah, I was hoping to avoid the need for x_constants/y_constants as well. But I've struggled to come up with a natural way to distinguish column labels from numeric vectors as well. Using a numeric numpy array for this might work out alright, but then I'm not sure what to do with a numpy array of strings or a numeric pandas Series. Do you have any thoughts @philippjfr ?

@jbednar
Copy link
Member

jbednar commented Jan 25, 2019

I guess to refine my proposal then, an array or Pandas series type would be treated as a constant numeric vector, while if one has a numpy array of strings, you'd have to convert it to a list of strings for it to be treated as a list of column names.

@jonmmease
Copy link
Collaborator Author

I guess to refine my proposal then, an array or Pandas series type would be treated as a constant numeric vector, while if one has a numpy array of strings, you'd have to convert it to a list of strings for it to be treated as a list of column names.

Works for me!

@jonmmease
Copy link
Collaborator Author

Ok, I added the axis argument to Canvas.line and removed Canvas.lines. So now the following two usages are supported.

One line across all rows

cvs.line(df, x='A1', y='B1')
cvs.line(df, x='A1', y='B1', axis=0)

Many fixed length lines

cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=1)

The other usages described above can be added in a follow-on PR.

@jonmmease
Copy link
Collaborator Author

jonmmease commented Jan 28, 2019

Alright, I went ahead and implemented all of the aggregation approaches in the proposed API except for RaggedArray.

Examples from the new Canvas.line docstring

Define a canvas and a pandas DataFrame with 6 rows

import pandas as pd
import numpy as np
from datashader import Canvas
import datashader.transfer_functions as tf
cvs = Canvas(plot_height=400, plot_width=400)
df = pd.DataFrame({
   'A1': [1, 1.5, 2, 2.5, 3, 4],
   'A2': [1.5, 2, 3, 3.2, 4, 5],
   'B1': [10, 12, 11, 14, 13, 15],
   'B2': [11, 9, 10, 7, 8, 12],
}, dtype='float64')
df

screen shot 2019-01-29 at 10 02 30 am

Aggregate one line across all rows, with coordinates df.A1 by df.B1

agg = cvs.line(df, x='A1' , y='B1')
tf.shade(agg)

download 1

Aggregate two lines across all rows. The first with coordinates df.A1 by df.B1 and the second with coordinates df.A2 by df.B2

agg = cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=0)
tf.shade(agg)

download 2

Aggregate two lines across all rows where the lines share the same x coordinates. The first line will have coordinates df.A1 by df.B1 and the second will have coordinates df.A1 by df.B2.

agg = cvs.line(df, x='A1', y=['B1', 'B2'], axis=0)
tf.shade(agg)

download 3

Aggregate 6 length-2 lines, one per row, where the ith line has coordinates [df.A1[i], df.A2[i]] by [df.B1[i], df.B2[i]]

agg = cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=1)
tf.shade(agg)

download 4

Aggregate 6 length-4 lines, one per row, where the x coordinates of every line are [0, 1, 2, 3] and the y coordinates of the ith line are [df.A1[i], df.A2[i], df.B1[i], df.B2[i]].

agg = cvs.line(df, x=np.arange(4), y=['A1', 'A2', 'B1', 'B2'], axis=1)
tf.shade(agg)

download 5

Documentation

I haven't updated the documentation notebooks yet, but with these changes I'll be able to rewrite the last section of http://datashader.org/user_guide/3_Timeseries.html and remove the use of the dataframe_from_multiple_sequences utility function all together.

Performance

Updated performance results for aggregation time of length-2 line segments

newplot 2

newplot 3

newplot 4

@jbednar
Copy link
Member

jbednar commented Jan 28, 2019

Looks great!!! Thanks for rounding that out. Seems like a very reasonable API now.

@jonmmease
Copy link
Collaborator Author

I added the Dask tests, and rewrote the last section of the Timeseries documentation to use the new approach.

Ready for review @jbednar

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fabulous! This work really helps make all the various scenarios that come up in practice be handled efficiently and conveniently. Great work. I did look at every line of the code involved, but I have to admit that my eyes glaze over a bit in some of the deep line-drawing bits, so I can't honestly say that I'd have detected any problems there if any exist. :-) But from what I could focus on, it looks good, and is ready to merge when tests pass.

return minval, maxval

@staticmethod
def maybe_expand_bounds(bounds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful thing to refactor, thanks.

@jbednar
Copy link
Member

jbednar commented Jan 30, 2019

I modified the last example in the user-guide notebook so that it was a random walk with a fixed starting point:

image

This now seems like a very reasonable synthesized approximation to something people will use in practice, i.e. overlaying huge numbers of lines with the same starting point.

@jonmmease
Copy link
Collaborator Author

Hmm, hitting an AppVeyor permission denied error on Python 2.7. Is this the same one we've been fighting with?

screen shot 2019-01-30 at 7 01 58 pm

@jbednar jbednar changed the title Add Canvas.lines for rendering a collection of lines (one per row) Support for row-oriented coordinates in Canvas.line Jan 31, 2019
@jonmmease
Copy link
Collaborator Author

Merging so I can go ahead and rebase #696. Thanks for the review and doc updates @jbednar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants