Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bogus column and row at edge? #259

Closed
jbednar opened this issue Nov 4, 2016 · 1 comment
Closed

Bogus column and row at edge? #259

jbednar opened this issue Nov 4, 2016 · 1 comment
Milestone

Comments

@jbednar
Copy link
Member

jbednar commented Nov 4, 2016

There seems to be an empty column and row at the edge of aggregated arrays:

import pandas as pd
import numpy as np
import datashader as ds
import datashader.transfer_functions as tf

df = pd.DataFrame(dict(x=np.random.normal(0,10,10000),y=np.random.normal(0,10,10000)))

canvas = ds.Canvas(plot_width=8, plot_height=8, x_range=(-4,4), y_range=(-4,4))
agg = canvas.points(df, 'x', 'y')
agg.values
array([[21, 13, 21, 26, 17, 18, 20,  0],
       [22, 16, 28, 23, 17, 16, 23,  0],
       [17, 20, 22, 20, 20, 19, 26,  0],
       [16, 14, 28, 14, 12, 16, 22,  0],
       [16, 17, 16, 25, 19, 18, 18,  0],
       [ 9, 24, 24, 24, 17, 14, 17,  0],
       [19, 24, 17, 19, 13, 12, 21,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

The extra row and column ends up being transparent, at least for counts, so it has little effect, but it still seems like we should be returning a fully filled-out array.

@jbednar
Copy link
Member Author

jbednar commented Nov 6, 2016

After looking at the code, it looks to me like the handling of the last column has multiple issues. For example:

import pandas as pd
import numpy as np
import datashader as ds
import datashader.transfer_functions as tf

num=10000

df = pd.DataFrame(dict(x=[0.1, 0.5, 1.1, 1.5, 2.2, 2.9],
                       y=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1]))

canvas = ds.Canvas(plot_width=3, plot_height=3,
x_range=(0,3), y_range=(0,3))
agg = canvas.points(df,'x','y')

Here there are six data points, all within the specified range, and I would expect them to fit evenly into the pixel array as follows:

  0  1  2  3
0 _____________
  |. .|. .|. .|
1 |---+---+---|
  |   |   |   |
2 |---+---+---|
  |___|___|___|
3

Instead the current code yields:

array([[3, 3, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=int32)

I.e., all six datapoints are being squeezed into only two of the pixels, which is difficult to justify from their coordinates. The correct answer can be obtained by removing a bogus "-1":

diff --git a/datashader/core.py b/datashader/core.py
index c1c3fcf..574aa9f 100644
@@ -61,7 +62,7 @@ class Axis(object):

         """
         start, end = map(self.mapper, range)
-        s = (n - 1)/(end - start)
+        s = n/(end - start)
         t = -start * s
         return s, t

yielding:

array([[2, 2, 2],
       [0, 0, 0],
       [0, 0, 0]], dtype=int32)

Great! However, there are then problems with the upper bound. According to the docstring for ds.core.Axis, the range is meant to be inclusive on both ends, and indeed if the point (3.0,0.1) were to be included in the dataframe above, the original code will count it in the last cell, yielding [3,3,1]. But with the change from n-1 to n, this new datapoint ends up very dangerously overwriting the next cell in memory:

array([[2, 2, 2],
       [1, 0, 0],
       [0, 0, 0]], dtype=int32)

Obviously, the result is then not just incorrect, but if that point is at the end of the array then segfaults are likely. To fix this, we can define the upper bound to be exclusive:

diff --git a/datashader/core.py b/datashader/core.py
index c1c3fcf..574aa9f 100644
--- a/datashader/core.py
+++ b/datashader/core.py
@@ -49,8 +49,9 @@ class Axis(object):
         Parameters
         ----------
         range : tuple
-            A tuple representing the boundary inclusive range ``[min, max]``
-            along the axis, in data space.
+            A tuple representing the range ``[min, max]`` along the axis, in data space.
+            min is inclusive and max is exclusive.
+            
         n : int
             The number of bins along the axis.

@@ -61,7 +62,7 @@ class Axis(object):

         """
         start, end = map(self.mapper, range)
-        s = (n - 1)/(end - start)
+        s = n/(end - start)
         t = -start * s
         return s, t

diff --git a/datashader/glyphs.py b/datashader/glyphs.py
index a61e66b..e765fa1 100644
--- a/datashader/glyphs.py
+++ b/datashader/glyphs.py
@@ -57,7 +57,7 @@ class Point(_PointLike):
             for i in range(xs.shape[0]):
                 x = xs[i]
                 y = ys[i]
-                if (xmin <= x <= xmax) and (ymin <= y <= ymax):
+                if (xmin <= x < xmax) and (ymin <= y < ymax):                
                     append(i,
                            int(x_mapper(x) * sx + tx),
                            int(y_mapper(y) * sy + ty),

But this is a change in semantics, and it may be surprising to users that points on the upper boundary are not included in the counts. Still, it seems like there would need to be quite a lot of extra code and checking to somehow ensure that points on the upper boundary are included if the mapping is done correctly, since no other cell boundary can act that way (or else points in interior cells would be counted twice).

See my HoloViews continuous coordinates tutorial for a detailed analysis of these issues; it's focusing on making a regular grid of samples from a continuous function, but basically the same approach applies to irregular samples from a continuous function as we have here.
For what it's worth, in the code used in HoloViews we decided many years ago that it was simpler and more consistent (both self-consistent and consistent with how arrays work in Python/Numpy) to make all upper boundaries exclusive.

If we do so here, we will then need to update the test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants