Bogus column and row at edge? #259

jbednar · 2016-11-04T23:16:32Z

There seems to be an empty column and row at the edge of aggregated arrays:

import pandas as pd
import numpy as np
import datashader as ds
import datashader.transfer_functions as tf

df = pd.DataFrame(dict(x=np.random.normal(0,10,10000),y=np.random.normal(0,10,10000)))

canvas = ds.Canvas(plot_width=8, plot_height=8, x_range=(-4,4), y_range=(-4,4))
agg = canvas.points(df, 'x', 'y')
agg.values

array([[21, 13, 21, 26, 17, 18, 20,  0],
       [22, 16, 28, 23, 17, 16, 23,  0],
       [17, 20, 22, 20, 20, 19, 26,  0],
       [16, 14, 28, 14, 12, 16, 22,  0],
       [16, 17, 16, 25, 19, 18, 18,  0],
       [ 9, 24, 24, 24, 17, 14, 17,  0],
       [19, 24, 17, 19, 13, 12, 21,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

The extra row and column ends up being transparent, at least for counts, so it has little effect, but it still seems like we should be returning a fully filled-out array.

The text was updated successfully, but these errors were encountered:

jbednar · 2016-11-06T12:43:16Z

After looking at the code, it looks to me like the handling of the last column has multiple issues. For example:

import pandas as pd
import numpy as np
import datashader as ds
import datashader.transfer_functions as tf

num=10000

df = pd.DataFrame(dict(x=[0.1, 0.5, 1.1, 1.5, 2.2, 2.9],
                       y=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1]))

canvas = ds.Canvas(plot_width=3, plot_height=3,
x_range=(0,3), y_range=(0,3))
agg = canvas.points(df,'x','y')

Here there are six data points, all within the specified range, and I would expect them to fit evenly into the pixel array as follows:

  0  1  2  3
0 _____________
  |. .|. .|. .|
1 |---+---+---|
  |   |   |   |
2 |---+---+---|
  |___|___|___|
3

Instead the current code yields:

array([[3, 3, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=int32)

I.e., all six datapoints are being squeezed into only two of the pixels, which is difficult to justify from their coordinates. The correct answer can be obtained by removing a bogus "-1":

diff --git a/datashader/core.py b/datashader/core.py
index c1c3fcf..574aa9f 100644
@@ -61,7 +62,7 @@ class Axis(object):

         """
         start, end = map(self.mapper, range)
-        s = (n - 1)/(end - start)
+        s = n/(end - start)
         t = -start * s
         return s, t

yielding:

array([[2, 2, 2],
       [0, 0, 0],
       [0, 0, 0]], dtype=int32)

Great! However, there are then problems with the upper bound. According to the docstring for ds.core.Axis, the range is meant to be inclusive on both ends, and indeed if the point (3.0,0.1) were to be included in the dataframe above, the original code will count it in the last cell, yielding [3,3,1]. But with the change from n-1 to n, this new datapoint ends up very dangerously overwriting the next cell in memory:

array([[2, 2, 2],
       [1, 0, 0],
       [0, 0, 0]], dtype=int32)

Obviously, the result is then not just incorrect, but if that point is at the end of the array then segfaults are likely. To fix this, we can define the upper bound to be exclusive:

diff --git a/datashader/core.py b/datashader/core.py
index c1c3fcf..574aa9f 100644
--- a/datashader/core.py
+++ b/datashader/core.py
@@ -49,8 +49,9 @@ class Axis(object):
         Parameters
         ----------
         range : tuple
-            A tuple representing the boundary inclusive range ``[min, max]``
-            along the axis, in data space.
+            A tuple representing the range ``[min, max]`` along the axis, in data space.
+            min is inclusive and max is exclusive.
+            
         n : int
             The number of bins along the axis.

@@ -61,7 +62,7 @@ class Axis(object):

         """
         start, end = map(self.mapper, range)
-        s = (n - 1)/(end - start)
+        s = n/(end - start)
         t = -start * s
         return s, t

diff --git a/datashader/glyphs.py b/datashader/glyphs.py
index a61e66b..e765fa1 100644
--- a/datashader/glyphs.py
+++ b/datashader/glyphs.py
@@ -57,7 +57,7 @@ class Point(_PointLike):
             for i in range(xs.shape[0]):
                 x = xs[i]
                 y = ys[i]
-                if (xmin <= x <= xmax) and (ymin <= y <= ymax):
+                if (xmin <= x < xmax) and (ymin <= y < ymax):                
                     append(i,
                            int(x_mapper(x) * sx + tx),
                            int(y_mapper(y) * sy + ty),

But this is a change in semantics, and it may be surprising to users that points on the upper boundary are not included in the counts. Still, it seems like there would need to be quite a lot of extra code and checking to somehow ensure that points on the upper boundary are included if the mapping is done correctly, since no other cell boundary can act that way (or else points in interior cells would be counted twice).

See my HoloViews continuous coordinates tutorial for a detailed analysis of these issues; it's focusing on making a regular grid of samples from a continuous function, but basically the same approach applies to irregular samples from a continuous function as we have here.
For what it's worth, in the code used in HoloViews we decided many years ago that it was simpler and more consistent (both self-consistent and consistent with how arrays work in Python/Numpy) to make all upper boundaries exclusive.

If we do so here, we will then need to update the test data.

Fix holoviz#259.

jbednar assigned jbcrail Apr 11, 2017

jbednar mentioned this issue Apr 13, 2017

Performance comparison to vaex #310

Closed

jbcrail added a commit to jbcrail/datashader that referenced this issue Apr 24, 2017

Fix off-by-one error

68d51fe

Fix holoviz#259.

jbcrail mentioned this issue Apr 24, 2017

Fix bogus column and row at edge #318

Merged

jbcrail added this to the 0.5.0 milestone Apr 26, 2017

jbednar closed this as completed in #318 Apr 26, 2017

jlstevens unassigned jbcrail Sep 4, 2018

jbednar mentioned this issue Jan 5, 2019

Fix index-out-of-bounds line glyph crash and run select tests with JIT disabled #683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bogus column and row at edge? #259

Bogus column and row at edge? #259

jbednar commented Nov 4, 2016

jbednar commented Nov 6, 2016

Bogus column and row at edge? #259

Bogus column and row at edge? #259

Comments

jbednar commented Nov 4, 2016

jbednar commented Nov 6, 2016