Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting many lines as separate columns takes a long time #286

Closed
StevenCHowell opened this issue Feb 19, 2017 · 16 comments
Closed

Plotting many lines as separate columns takes a long time #286

StevenCHowell opened this issue Feb 19, 2017 · 16 comments

Comments

@StevenCHowell
Copy link

I am hoping there is a misunderstanding or misuse on my part but plotting lots of lines with Datashader is taking much longer than I would expect, at least compared to plotting points. To provide an example case, I wrote the following code which demonstrates what I am seeing. As a summary, this figure shows how many seconds it takes to plot n lines, each with 50 data points.
ds_line_test
While this shows it scales as n log(n), better than n^2, the large amount of time, over 4.5 hours for 8192 lines, makes me wonder if it is actually running using optimized numba code. Again, I hope this is simply misuse on my part as I want to use this to plot over 100,000 lines. I have got similar results when running on in linux desktop (12 core, 32 GB ram), and on my mac laptop. I did run this in a jupyter notebook; would it perform better in a python script?

Here is the code I am running:

import pandas as pd
import numpy as np
import datashader
import bokeh.plotting
import collections
import xarray
import time
from bokeh.palettes import Colorblind7 as palette

bokeh.plotting.output_notebook()

# create some data worth plotting
x = np.linspace(0, np.pi * 2)
y = np.sin(x)
n = 100000
data = np.empty([n+1, len(y)])
data[0] = x
prng = np.random.RandomState(123)
offset = prng.normal(0, 0.1, n).reshape(n, -1)
data[1:] = y
data[1:] += offset
df = pd.DataFrame(data.T)
x_range = 0, 2*np.pi
y_range = -1.5, 1.5
y_cols = range(1, n+1)

# iterate over increasing number of lines
run_times = []
imgs = []
test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
for n in test:
    this_df = df[:n+1]
    y_cols = range(1, n+1)
    tic = time.time()
    canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                               plot_height=300, plot_width=300)
    aggs = collections.OrderedDict((c, canvas.line(df, 0, c)) for c in y_cols)
    merged = xarray.concat(aggs.values(), dim=pd.Index(y_cols, name='cols'))
    img = datashader.transfer_functions.shade(merged.sum(dim='cols'), how='eq_hist')
    toc = time.time() - tic
    run_times.append(toc)
    imgs.append(img)

# plot the result
p = bokeh.plotting.figure(y_axis_label='time (s)', x_axis_label='n lines',
                          width=400, height=300, x_axis_type='log',
                          y_axis_type='log',
                          title='Run-time for Datashader line plot')
p.circle(test, run_times, legend='run times', color=palette[0])

# what is the slope of the rate of increase?
test = np.array(test)
n2 = test ** 2 # + run_times[0]
n3 = test ** 3 # + run_times[0]
nlogn = test * np.log(n) # + run_times[0]
p.line(test, n2, legend='n^2', color=palette[1])
p.line(test, n3, legend='n^3', color=palette[2])
p.line(test, nlogn, legend='n log(n)', color=palette[3])

p.legend.location = 'top_left'
bokeh.plotting.show(p)

Here is the last image (using 8192 of the total 100000 lines):

ds_8192_lines

@StevenCHowell
Copy link
Author

Out of curiosity, I wanted to see how fast bokeh could do these same plots (not going to look good but interested in comparing the timings). Bokeh was much faster but it looks like bokeh increases at the n^2 rate instead of the n log(n) rate so it will soon crossover.

ds_bokeh_line_test

bokeh_run_times = []
plots = []
for n in test:
    tic = time.time()
    p = bokeh.plotting.figure()
    for in in range(1, n+1):
        p.line(data[0], data[i])
    toc = time.time() - tic
    bokeh_run_time.append(doc)
    plots.append(p)

@jbednar
Copy link
Member

jbednar commented Feb 21, 2017

Running in a notebook should have the same speed as a bare script, for something like this. Here it seems like you are aggregating each line separately, then summing over all of them? I haven't tried that approach with large numbers of lines, and I'm not sure what its computational complexity would be. What we usually do is to have all such lines in a single data frame, separated by all-NaN rows, in which case it's approximately linear in the total number of points (assuming the total number of points is much greater than the number of lines).

@StevenCHowell
Copy link
Author

StevenCHowell commented Feb 21, 2017

@jbednar can you provide an example of what you are describing, perhaps using the example sine data I have above? My implementation was patterned after the time series notebook you have on Anaconda Cloud. Here is the core bit of code I am timing:

    cvs = ds.Canvas(x_range=x_range, y_range=y_range,
                    plot_height=h, plot_width=w)
    aggs = OrderedDict((c,cvs.line(df, 'Time', c)) for c in cols)
    merged = xr.concat(aggs.values(), dim=pd.Index(cols, name='cols'))
    total = merged.sum(dim='cols')
    img = tf.shade(total, how='linear')

That was the best example I found of plotting many lines.

@jbednar
Copy link
Member

jbednar commented Feb 21, 2017

See the OpenSky example, which has about 200,000 separate trajectories. The timeseries notebook focuses on a few curves with very many samples each, so it's maybe not the best starting point for this case.

@StevenCHowell
Copy link
Author

In the OpenSky example, it is not clear how a single flight trajectory is stored in the flightpaths DataFrame. I see there are longitude and latitude coordinates, but how are different flights separated, using the ascending column?

Are these simply straight lines between two points? The city specific plots make it look like that is not the case.

I am not sure how I would reformat many lines from a 2 column text file into a comparable DataFrame. I have tens of thousands of data files. As they all have the same x-points, I currently load all the y-values into a NumPy array. I could just as easily put each set of y-values into a column in a DataFrame, but that is nothing like what the OpenSky DataFrame looks like.

It almost looks like it was drawn by connecting subsequent longitude-latitude points, but this would draw lines between where one plane lands and the next takes off (unless it is all the same plane, which does not seem likely). Mimicking this with thousands of sine curves, I would end up with lines from the last point of one curve to the first point of the next curve. How do you avoid this?

@jbednar
Copy link
Member

jbednar commented Feb 22, 2017

Plotting line segments, time series, and trajectories all uses the same underlying Line glyph support. Fundamentally, datashader accepts a dataframe where each row represents one point along a multiline, and it will then connect that point to the point provided in the subsequent row.

You are correct that doing so would falsely connect between unrelated lines, but as I briefly referred to above, whenever a row has a NaN in the X or Y (or usually both), this separator is treated as a break between lines, and the preceding and subsequent points are not connected.

So, if you have 5 curves of 10 points each, then the dataframe would store the 10 (x,y) pairs from the first curve, then (NaN,NaN), then the 10 (x,y) pairs from the second curve, and so on, ending up with 54 or 55 rows in the dataframe, not 50.

Creating such a data structure from numpy arrays shouldn't be very difficult:

x = np.fromfile("xfile")
dfs = []

for f in files:
    y = np.fromfile(f)
    df = pd.DataFrame({"x": x, "y": y})
    dfs += [df, pd.DataFrame({"x": [np.nan]})]

df = pd.concat(dfs,ignore_index=True)

@jbednar
Copy link
Member

jbednar commented Feb 22, 2017

(But note that the released version of datashader has a bug in the line support, skipping over the last or first point (can't remember which), so you should use the github master until we can do a new release.)

@StevenCHowell
Copy link
Author

StevenCHowell commented Feb 23, 2017

This is MUCH faster! Here are the timing comparisons.

ds_lines

This prompts me to ask, why not use this method of rearranging the data for the time series data? With any number of data sets, reformatting this into two columns is orders of magnitude faster (~4000x faster for 8 data sets, close to the 10 used in the time series data).

I also wonder, what would it take to apply the faster functionality of the two column method to many columns?

Here is the code to reformat the above data and apply the OpenSky method (with the timing bits).

dfs = []
split = pd.DataFrame({'x': [np.nan]})
for i in range(len(data)-1):
    x = data[0]
    y = data[i+1]
    df = pd.DataFrame({'x': x, 'y': y})
    dfs.append(df)
    dfs.append(split)
    
df = pd.concat(dfs, ignore_index=True)   

nx = len(x) + 1
run_times2 = []
imgs2 = []
test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
for n in test:
    this_df = df[:n+1]
    y_cols = range(1, n+1)
    tic = time.time()
    canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                               plot_height=300, plot_width=300)
    agg = canvas.line(df.iloc[:2*nx], 'x', 'y', datashader.count())
    img = datashader.transfer_functions.shade(agg, how='eq_hist')
    toc = time.time() - tic
    run_times2.append(toc)
    imgs2.append(img)

@StevenCHowell
Copy link
Author

StevenCHowell commented Feb 23, 2017

This provides a complete example (without the timing bits). Perhaps this could go in the gallery examples (#288 ):

import pandas as pd
import numpy as np
import datashader

# create some data worth plotting
nx = 50
x = np.linspace(0, np.pi * 2, nx)
y = np.sin(x)
n = 10000
data = np.empty([n+1, len(y)])
data[0] = x
prng = np.random.RandomState(123)

# scale the data using a random normal distribution
offset = prng.normal(0, 0.1, n).reshape(n, -1)
data[1:] = y
data[1:] += offset

# make some data noisy
n_noisy = prng.randint(0, n,5)
for i in n_noisy:
    data[i+1] += prng.normal(0, 0.5, nx)

# reformat the data into an appropriate DataFrame
dfs = []
split = pd.DataFrame({'x': [np.nan]})
for i in range(len(data)-1):
    x = data[0]
    y = data[i+1]
    df = pd.DataFrame({'x': x, 'y': y})
    dfs.append(df)
    dfs.append(split)
df = pd.concat(dfs, ignore_index=True)   

x_range = data[0].min(), data[0].max()
y_range = data[1:].min(), data[1:].max()

# actually make the plot
canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                           plot_height=300, plot_width=300)
agg = canvas.line(df, 'x', 'y', datashader.count())
img = datashader.transfer_functions.shade(agg, how='eq_hist')
img

example

@StevenCHowell
Copy link
Author

StevenCHowell commented Feb 23, 2017

While it remains awkward to plot NumPy arrays, I think the second method resolves the speed issue. In connection with issue #283, it would be nice to make the input method for plotting lines from NumPy arrays more intuitive. This would remove the awkward DataFrame generation. For now, the two column DataFrame seems a functional option.

@jbednar jbednar changed the title Plotting many lines takes a long time Plotting many lines as separate columns takes a long time Feb 23, 2017
@narendramukherjee
Copy link
Contributor

narendramukherjee commented Oct 5, 2017

Thanks @StevenCHowell for this helpful discussion. Also thanks to @jbednar for suggesting that we use datashader in response to our paper in Scipy 2017 (http://conference.scipy.org/proceedings/scipy2017/narendra_mukherjee.html). I was just wondering what the current state of affairs is for plotting multiple sequences stored in a numpy array with datashader. Although the workaround described above works very well on the memory consumption side of things for me (reduces memory use by about a factor of 4 compared to plain overlaying the sequences on top of each other with matplotlib), creating the pandas dataframe consumes a significantly large amount of time compared to everything else. Here I am plotting ~65000 time series, each with 45 time points - here are some rough timing numbers:

In [114]: data.shape
Out[114]: (65539, 45)

In [115]: %timeit %paste
dfs = []
split = pd.DataFrame({'x': [np.nan]})
for i in range(len(data)-1):
    x = data[0]
    y = data[i+1]
    df = pd.DataFrame({'x': x, 'y': y})
    dfs.append(df)
    dfs.append(split)

df = pd.concat(dfs, ignore_index=True)

## -- End pasted text --
1 loop, best of 3: 53.4 s per loop

In [116]: %timeit canvas = datashader.Canvas(x_range=(0, 45), y_range=(df['y'].m
     ...: in() - 10, df['y'].max() + 10), plot_height=600, plot_width=800)
10 loops, best of 3: 37.4 ms per loop

In [117]: %timeit agg = canvas.line(df, 'x', 'y', datashader.count())
1 loop, best of 3: 3.03 s per loop

Creating the pandas dataframe is by far the slowest step of the process - it takes almost a minute. While that is not that significant for plotting just one set of 65000 curves, I usually have to plot 40-50 of these sets in one go. That, obviously, takes a lot of time :|

@jbednar
Copy link
Member

jbednar commented Oct 5, 2017

Hmm. We too are a bit frustrated here, but I don't know the best way forward. The NaN-separated format is fast to plot, but it is very expensive to create, and not very usable once it's created. E.g. there's no obvious way to index into it to select individual polylines to select or filter them out.

We could support directly iterating over a multidimensional numpy or xarray array of polylines, but unlike the NaN-separated format, that approach would be limited to equal-length lines, which is only one of many common cases. For instance, when rendering network graphs, we want to plot a very large number of short polylines of variable lengths.

It would be great if there were a very efficient ragged array format that we could use to fit all of these requirements, i.e. for something indexable, quickly iterable in Numba code, and memory efficient. I'm pretty sure there isn't one readily available in Python, but as fixed-length arrays are an important special case for graph rendering (directly connected nodes), maybe we should just bite the bullet and support the special case of large Numpy/xarray arrays of fixed-length polylines?

@philippjfr, what do you think?

@philippjfr
Copy link
Member

Since it doesn't seem like we are going to solve the ragged array issue any time soon, I'd be in favor of adding some handling to support fix-length arrays in the short term. What I'm not clear on which format we'd want to support, the most obvious format is a dataframe with an index for x-values and N-columns for the y-values or simply an array where the first column represents the x-values and the remainder the y-columns. The dataframe format seems to make more sense since that's what we support already.

@narendramukherjee
Copy link
Contributor

narendramukherjee commented Oct 6, 2017

Sorry for the confusion @jbednar and @philippjfr - I think the dataframe appending step in the solution posted above was the slow step, and I just copied it lazily. Here's a much faster implementation that avoids the appending step with some simple Numpy operations:

In [43]: data.shape
Out[43]: (65539, 45)

In [44]: new_waveforms = np.zeros((data.shape[0], data.shape[1] + 1))

In [45]: new_waveforms[:, -1] = np.nan

In [47]: new_waveforms[:, :-1] = data

In [48]: new_waveforms.shape
Out[48]: (65539, 46)

In [49]: x = np.zeros(46)

In [50]: x[-1] = np.nan

In [51]: x[:-1] = np.arange(45)

In [52]: x = np.tile(x, new_waveforms.shape[0])

In [53]: x.shape
Out[53]: (3014794,)

In [54]: %timeit df = pd.DataFrame({'x': x, 'y': new_waveforms.flatten()})
10 loops, best of 3: 21.5 ms per loop

That's a simple fix to the problem, at least for equal length sequences - as long as the appending part is not done in pandas, the speed issue shouldn't be a problem. Maybe we could put this up as a cleaner example of how to deal with large number of sequences?

@jbednar
Copy link
Member

jbednar commented Oct 6, 2017

We can certainly provide utility functions for converting common data structures that are used in practice into ones that datashader can deal with directly. I'd be happy to see the above generalized and put into a function in ds.utils, along with an example somewhere in the notebooks of how to use it (e.g. in tseries.ipynb). Any hope of you creating a PR for that? :-)

With that in place we can then determine if there's a big savings to be had by supporting a fixed-length array type directly.

@narendramukherjee
Copy link
Contributor

Cool, I will have a look at ds.utils and tseries.ipynb and put in a PR soon :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants