Plotting many lines as separate columns takes a long time #286

StevenCHowell · 2017-02-19T19:45:11Z

I am hoping there is a misunderstanding or misuse on my part but plotting lots of lines with Datashader is taking much longer than I would expect, at least compared to plotting points. To provide an example case, I wrote the following code which demonstrates what I am seeing. As a summary, this figure shows how many seconds it takes to plot n lines, each with 50 data points.

While this shows it scales as n log(n), better than n^2, the large amount of time, over 4.5 hours for 8192 lines, makes me wonder if it is actually running using optimized numba code. Again, I hope this is simply misuse on my part as I want to use this to plot over 100,000 lines. I have got similar results when running on in linux desktop (12 core, 32 GB ram), and on my mac laptop. I did run this in a jupyter notebook; would it perform better in a python script?

Here is the code I am running:

import pandas as pd
import numpy as np
import datashader
import bokeh.plotting
import collections
import xarray
import time
from bokeh.palettes import Colorblind7 as palette

bokeh.plotting.output_notebook()

# create some data worth plotting
x = np.linspace(0, np.pi * 2)
y = np.sin(x)
n = 100000
data = np.empty([n+1, len(y)])
data[0] = x
prng = np.random.RandomState(123)
offset = prng.normal(0, 0.1, n).reshape(n, -1)
data[1:] = y
data[1:] += offset
df = pd.DataFrame(data.T)
x_range = 0, 2*np.pi
y_range = -1.5, 1.5
y_cols = range(1, n+1)

# iterate over increasing number of lines
run_times = []
imgs = []
test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
for n in test:
    this_df = df[:n+1]
    y_cols = range(1, n+1)
    tic = time.time()
    canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                               plot_height=300, plot_width=300)
    aggs = collections.OrderedDict((c, canvas.line(df, 0, c)) for c in y_cols)
    merged = xarray.concat(aggs.values(), dim=pd.Index(y_cols, name='cols'))
    img = datashader.transfer_functions.shade(merged.sum(dim='cols'), how='eq_hist')
    toc = time.time() - tic
    run_times.append(toc)
    imgs.append(img)

# plot the result
p = bokeh.plotting.figure(y_axis_label='time (s)', x_axis_label='n lines',
                          width=400, height=300, x_axis_type='log',
                          y_axis_type='log',
                          title='Run-time for Datashader line plot')
p.circle(test, run_times, legend='run times', color=palette[0])

# what is the slope of the rate of increase?
test = np.array(test)
n2 = test ** 2 # + run_times[0]
n3 = test ** 3 # + run_times[0]
nlogn = test * np.log(n) # + run_times[0]
p.line(test, n2, legend='n^2', color=palette[1])
p.line(test, n3, legend='n^3', color=palette[2])
p.line(test, nlogn, legend='n log(n)', color=palette[3])

p.legend.location = 'top_left'
bokeh.plotting.show(p)

Here is the last image (using 8192 of the total 100000 lines):

The text was updated successfully, but these errors were encountered:

StevenCHowell · 2017-02-19T23:35:57Z

Out of curiosity, I wanted to see how fast bokeh could do these same plots (not going to look good but interested in comparing the timings). Bokeh was much faster but it looks like bokeh increases at the n^2 rate instead of the n log(n) rate so it will soon crossover.

bokeh_run_times = []
plots = []
for n in test:
    tic = time.time()
    p = bokeh.plotting.figure()
    for in in range(1, n+1):
        p.line(data[0], data[i])
    toc = time.time() - tic
    bokeh_run_time.append(doc)
    plots.append(p)

jbednar · 2017-02-21T03:53:36Z

Running in a notebook should have the same speed as a bare script, for something like this. Here it seems like you are aggregating each line separately, then summing over all of them? I haven't tried that approach with large numbers of lines, and I'm not sure what its computational complexity would be. What we usually do is to have all such lines in a single data frame, separated by all-NaN rows, in which case it's approximately linear in the total number of points (assuming the total number of points is much greater than the number of lines).

StevenCHowell · 2017-02-21T13:50:39Z

@jbednar can you provide an example of what you are describing, perhaps using the example sine data I have above? My implementation was patterned after the time series notebook you have on Anaconda Cloud. Here is the core bit of code I am timing:

    cvs = ds.Canvas(x_range=x_range, y_range=y_range,
                    plot_height=h, plot_width=w)
    aggs = OrderedDict((c,cvs.line(df, 'Time', c)) for c in cols)
    merged = xr.concat(aggs.values(), dim=pd.Index(cols, name='cols'))
    total = merged.sum(dim='cols')
    img = tf.shade(total, how='linear')

That was the best example I found of plotting many lines.

jbednar · 2017-02-21T13:53:26Z

See the OpenSky example, which has about 200,000 separate trajectories. The timeseries notebook focuses on a few curves with very many samples each, so it's maybe not the best starting point for this case.

StevenCHowell · 2017-02-22T20:12:04Z

In the OpenSky example, it is not clear how a single flight trajectory is stored in the flightpaths DataFrame. I see there are longitude and latitude coordinates, but how are different flights separated, using the ascending column?

Are these simply straight lines between two points? The city specific plots make it look like that is not the case.

I am not sure how I would reformat many lines from a 2 column text file into a comparable DataFrame. I have tens of thousands of data files. As they all have the same x-points, I currently load all the y-values into a NumPy array. I could just as easily put each set of y-values into a column in a DataFrame, but that is nothing like what the OpenSky DataFrame looks like.

It almost looks like it was drawn by connecting subsequent longitude-latitude points, but this would draw lines between where one plane lands and the next takes off (unless it is all the same plane, which does not seem likely). Mimicking this with thousands of sine curves, I would end up with lines from the last point of one curve to the first point of the next curve. How do you avoid this?

jbednar · 2017-02-22T20:26:52Z

Plotting line segments, time series, and trajectories all uses the same underlying Line glyph support. Fundamentally, datashader accepts a dataframe where each row represents one point along a multiline, and it will then connect that point to the point provided in the subsequent row.

You are correct that doing so would falsely connect between unrelated lines, but as I briefly referred to above, whenever a row has a NaN in the X or Y (or usually both), this separator is treated as a break between lines, and the preceding and subsequent points are not connected.

So, if you have 5 curves of 10 points each, then the dataframe would store the 10 (x,y) pairs from the first curve, then (NaN,NaN), then the 10 (x,y) pairs from the second curve, and so on, ending up with 54 or 55 rows in the dataframe, not 50.

Creating such a data structure from numpy arrays shouldn't be very difficult:

x = np.fromfile("xfile")
dfs = []

for f in files:
    y = np.fromfile(f)
    df = pd.DataFrame({"x": x, "y": y})
    dfs += [df, pd.DataFrame({"x": [np.nan]})]

df = pd.concat(dfs,ignore_index=True)

jbednar · 2017-02-22T20:32:01Z

(But note that the released version of datashader has a bug in the line support, skipping over the last or first point (can't remember which), so you should use the github master until we can do a new release.)

StevenCHowell · 2017-02-23T05:12:42Z

This is MUCH faster! Here are the timing comparisons.

This prompts me to ask, why not use this method of rearranging the data for the time series data? With any number of data sets, reformatting this into two columns is orders of magnitude faster (~4000x faster for 8 data sets, close to the 10 used in the time series data).

I also wonder, what would it take to apply the faster functionality of the two column method to many columns?

Here is the code to reformat the above data and apply the OpenSky method (with the timing bits).

dfs = []
split = pd.DataFrame({'x': [np.nan]})
for i in range(len(data)-1):
    x = data[0]
    y = data[i+1]
    df = pd.DataFrame({'x': x, 'y': y})
    dfs.append(df)
    dfs.append(split)
    
df = pd.concat(dfs, ignore_index=True)   

nx = len(x) + 1
run_times2 = []
imgs2 = []
test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
for n in test:
    this_df = df[:n+1]
    y_cols = range(1, n+1)
    tic = time.time()
    canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                               plot_height=300, plot_width=300)
    agg = canvas.line(df.iloc[:2*nx], 'x', 'y', datashader.count())
    img = datashader.transfer_functions.shade(agg, how='eq_hist')
    toc = time.time() - tic
    run_times2.append(toc)
    imgs2.append(img)

StevenCHowell · 2017-02-23T05:13:06Z

This provides a complete example (without the timing bits). Perhaps this could go in the gallery examples (#288 ):

import pandas as pd
import numpy as np
import datashader

# create some data worth plotting
nx = 50
x = np.linspace(0, np.pi * 2, nx)
y = np.sin(x)
n = 10000
data = np.empty([n+1, len(y)])
data[0] = x
prng = np.random.RandomState(123)

# scale the data using a random normal distribution
offset = prng.normal(0, 0.1, n).reshape(n, -1)
data[1:] = y
data[1:] += offset

# make some data noisy
n_noisy = prng.randint(0, n,5)
for i in n_noisy:
    data[i+1] += prng.normal(0, 0.5, nx)

# reformat the data into an appropriate DataFrame
dfs = []
split = pd.DataFrame({'x': [np.nan]})
for i in range(len(data)-1):
    x = data[0]
    y = data[i+1]
    df = pd.DataFrame({'x': x, 'y': y})
    dfs.append(df)
    dfs.append(split)
df = pd.concat(dfs, ignore_index=True)   

x_range = data[0].min(), data[0].max()
y_range = data[1:].min(), data[1:].max()

# actually make the plot
canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                           plot_height=300, plot_width=300)
agg = canvas.line(df, 'x', 'y', datashader.count())
img = datashader.transfer_functions.shade(agg, how='eq_hist')
img

StevenCHowell · 2017-02-23T05:23:19Z

While it remains awkward to plot NumPy arrays, I think the second method resolves the speed issue. In connection with issue #283, it would be nice to make the input method for plotting lines from NumPy arrays more intuitive. This would remove the awkward DataFrame generation. For now, the two column DataFrame seems a functional option.

narendramukherjee · 2017-10-05T19:20:03Z

Thanks @StevenCHowell for this helpful discussion. Also thanks to @jbednar for suggesting that we use datashader in response to our paper in Scipy 2017 (http://conference.scipy.org/proceedings/scipy2017/narendra_mukherjee.html). I was just wondering what the current state of affairs is for plotting multiple sequences stored in a numpy array with datashader. Although the workaround described above works very well on the memory consumption side of things for me (reduces memory use by about a factor of 4 compared to plain overlaying the sequences on top of each other with matplotlib), creating the pandas dataframe consumes a significantly large amount of time compared to everything else. Here I am plotting ~65000 time series, each with 45 time points - here are some rough timing numbers:

In [114]: data.shape
Out[114]: (65539, 45)

In [115]: %timeit %paste
dfs = []
split = pd.DataFrame({'x': [np.nan]})
for i in range(len(data)-1):
    x = data[0]
    y = data[i+1]
    df = pd.DataFrame({'x': x, 'y': y})
    dfs.append(df)
    dfs.append(split)

df = pd.concat(dfs, ignore_index=True)

## -- End pasted text --
1 loop, best of 3: 53.4 s per loop

In [116]: %timeit canvas = datashader.Canvas(x_range=(0, 45), y_range=(df['y'].m
     ...: in() - 10, df['y'].max() + 10), plot_height=600, plot_width=800)
10 loops, best of 3: 37.4 ms per loop

In [117]: %timeit agg = canvas.line(df, 'x', 'y', datashader.count())
1 loop, best of 3: 3.03 s per loop

Creating the pandas dataframe is by far the slowest step of the process - it takes almost a minute. While that is not that significant for plotting just one set of 65000 curves, I usually have to plot 40-50 of these sets in one go. That, obviously, takes a lot of time :|

jbednar · 2017-10-05T23:28:17Z

Hmm. We too are a bit frustrated here, but I don't know the best way forward. The NaN-separated format is fast to plot, but it is very expensive to create, and not very usable once it's created. E.g. there's no obvious way to index into it to select individual polylines to select or filter them out.

We could support directly iterating over a multidimensional numpy or xarray array of polylines, but unlike the NaN-separated format, that approach would be limited to equal-length lines, which is only one of many common cases. For instance, when rendering network graphs, we want to plot a very large number of short polylines of variable lengths.

It would be great if there were a very efficient ragged array format that we could use to fit all of these requirements, i.e. for something indexable, quickly iterable in Numba code, and memory efficient. I'm pretty sure there isn't one readily available in Python, but as fixed-length arrays are an important special case for graph rendering (directly connected nodes), maybe we should just bite the bullet and support the special case of large Numpy/xarray arrays of fixed-length polylines?

@philippjfr, what do you think?

philippjfr · 2017-10-06T10:55:38Z

Since it doesn't seem like we are going to solve the ragged array issue any time soon, I'd be in favor of adding some handling to support fix-length arrays in the short term. What I'm not clear on which format we'd want to support, the most obvious format is a dataframe with an index for x-values and N-columns for the y-values or simply an array where the first column represents the x-values and the remainder the y-columns. The dataframe format seems to make more sense since that's what we support already.

narendramukherjee · 2017-10-06T14:45:44Z

Sorry for the confusion @jbednar and @philippjfr - I think the dataframe appending step in the solution posted above was the slow step, and I just copied it lazily. Here's a much faster implementation that avoids the appending step with some simple Numpy operations:

In [43]: data.shape
Out[43]: (65539, 45)

In [44]: new_waveforms = np.zeros((data.shape[0], data.shape[1] + 1))

In [45]: new_waveforms[:, -1] = np.nan

In [47]: new_waveforms[:, :-1] = data

In [48]: new_waveforms.shape
Out[48]: (65539, 46)

In [49]: x = np.zeros(46)

In [50]: x[-1] = np.nan

In [51]: x[:-1] = np.arange(45)

In [52]: x = np.tile(x, new_waveforms.shape[0])

In [53]: x.shape
Out[53]: (3014794,)

In [54]: %timeit df = pd.DataFrame({'x': x, 'y': new_waveforms.flatten()})
10 loops, best of 3: 21.5 ms per loop

That's a simple fix to the problem, at least for equal length sequences - as long as the appending part is not done in pandas, the speed issue shouldn't be a problem. Maybe we could put this up as a cleaner example of how to deal with large number of sequences?

jbednar · 2017-10-06T15:05:03Z

We can certainly provide utility functions for converting common data structures that are used in practice into ones that datashader can deal with directly. I'd be happy to see the above generalized and put into a function in ds.utils, along with an example somewhere in the notebooks of how to use it (e.g. in tseries.ipynb). Any hope of you creating a PR for that? :-)

With that in place we can then determine if there's a big savings to be had by supporting a fixed-length array type directly.

narendramukherjee · 2017-10-06T15:07:28Z

Cool, I will have a look at ds.utils and tseries.ipynb and put in a PR soon :)

StevenCHowell mentioned this issue Feb 21, 2017

Create an image gallery with links to the code that generates them #288

Open

StevenCHowell closed this as completed Feb 23, 2017

jbednar changed the title ~~Plotting many lines takes a long time~~ Plotting many lines as separate columns takes a long time Feb 23, 2017

StevenCHowell mentioned this issue Feb 23, 2017

Using multi-lines defined in separate columns #283

Closed

StevenCHowell mentioned this issue Mar 9, 2017

Datashader almost handles logarithmic axes in MPL #297

Open

narendramukherjee mentioned this issue Oct 26, 2017

Plotting large numbers of sequences/time series together: dealing with fixed length numpy arrays #512

Merged

jonmmease mentioned this issue Jan 22, 2019

Support for row-oriented coordinates in Canvas.line #694

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plotting many lines as separate columns takes a long time #286

Plotting many lines as separate columns takes a long time #286

StevenCHowell commented Feb 19, 2017

StevenCHowell commented Feb 19, 2017

jbednar commented Feb 21, 2017 •

edited

Loading

StevenCHowell commented Feb 21, 2017 •

edited

Loading

jbednar commented Feb 21, 2017 •

edited

Loading

StevenCHowell commented Feb 22, 2017

jbednar commented Feb 22, 2017 •

edited

Loading

jbednar commented Feb 22, 2017 •

edited

Loading

StevenCHowell commented Feb 23, 2017 •

edited

Loading

StevenCHowell commented Feb 23, 2017 •

edited

Loading

StevenCHowell commented Feb 23, 2017 •

edited

Loading

narendramukherjee commented Oct 5, 2017 •

edited

Loading

jbednar commented Oct 5, 2017 •

edited

Loading

philippjfr commented Oct 6, 2017

narendramukherjee commented Oct 6, 2017 •

edited

Loading

jbednar commented Oct 6, 2017 •

edited

Loading

narendramukherjee commented Oct 6, 2017

Plotting many lines as separate columns takes a long time #286

Plotting many lines as separate columns takes a long time #286

Comments

StevenCHowell commented Feb 19, 2017

StevenCHowell commented Feb 19, 2017

jbednar commented Feb 21, 2017 • edited Loading

StevenCHowell commented Feb 21, 2017 • edited Loading

jbednar commented Feb 21, 2017 • edited Loading

StevenCHowell commented Feb 22, 2017

jbednar commented Feb 22, 2017 • edited Loading

jbednar commented Feb 22, 2017 • edited Loading

StevenCHowell commented Feb 23, 2017 • edited Loading

StevenCHowell commented Feb 23, 2017 • edited Loading

StevenCHowell commented Feb 23, 2017 • edited Loading

narendramukherjee commented Oct 5, 2017 • edited Loading

jbednar commented Oct 5, 2017 • edited Loading

philippjfr commented Oct 6, 2017

narendramukherjee commented Oct 6, 2017 • edited Loading

jbednar commented Oct 6, 2017 • edited Loading

narendramukherjee commented Oct 6, 2017

jbednar commented Feb 21, 2017 •

edited

Loading

StevenCHowell commented Feb 21, 2017 •

edited

Loading

jbednar commented Feb 21, 2017 •

edited

Loading

jbednar commented Feb 22, 2017 •

edited

Loading

jbednar commented Feb 22, 2017 •

edited

Loading

StevenCHowell commented Feb 23, 2017 •

edited

Loading

StevenCHowell commented Feb 23, 2017 •

edited

Loading

StevenCHowell commented Feb 23, 2017 •

edited

Loading

narendramukherjee commented Oct 5, 2017 •

edited

Loading

jbednar commented Oct 5, 2017 •

edited

Loading

narendramukherjee commented Oct 6, 2017 •

edited

Loading

jbednar commented Oct 6, 2017 •

edited

Loading