-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent handling of NaNs in where reductions #1215
Comments
Thanks for raising this; it's important to be consistent about it! For 1, seems important to mention that the final result for a pixel will be NaN if no non-NaN values are found. For 2, I guess the reason to return -1 instead of NaN is because the type is typically integer? I'm not sure if xarray now supports Panda's nullable types so that even integers can have a null value, but that sure would be convenient. For 3, to make it concrete, let's consider a case where "value" is a magnitude of some sort, while "other" is an id. Then with option B, we'd get a NaN if the row with the max "value" is missing its id field. while for option A we'd get the id of the max "value" that does have an id. Is that right? If so I can imagine cases for which either A or B is appropriate, i.e. do I want some datapoint that I can look up, as close to the max as possible (A), or do I want the true max, even if it doesn't have an id (B). I think B is a safer default, though it might be useful to offer A for people who don't truly much care about the max in particular but just want some explorable ids. |
Agreed.
Yes, -1 is our "no data" value as row indexes are int64 and always >=0 so I chose -1 at it is the easiest number to use outside of this range. Any pandas or xarray support for nullable integers doesn't help us, within datashader we have to use numpy arrays with scalar dtypes as that is what is supported by numba.
Yes.
We could offer both via a keyword argument to a So if we pick one now, let's go for B as that is your preference and the safer default. It is also easier to explain and easier to implement. It will be a one-liner in user code to drop rows for which an explorable id does not exist and hence obtain option A. When we get round to documenting some examples of this functionality we can include a section about these options and how to obtain and use them. |
Now that we have support for
where
reductions (#1214 for Dask, CUDA to follow) there is a question of how to handleNaN
s. We need to decide on a consistent approach.1. Background
When this is used as an
agg
in aCanvas
aggregating function (e.g.Canvas.points
) the return is an array of pixels, each pixel containing the maximum value of the"value"
column of the suppliedDataFrame
of all columns that contribute to that pixel. If aDataFrame
row has a"value"
ofNaN
then it is skipped so that it does not contribute to the aggregation.2.
where
returning row indexThis is the simplest example of the new
where
reduction. Each pixel of the returned array is the index of the row (integer>= 0
) containing the maximum of the"value"
column. A"value"
ofNaN
is skipped as above so that row does not contribute to the returned array. Pixels for which there are no contributing rows contain-1
to indicate no data.3.
where
returning another columnThis is the scenario we need to decide about.
Here both
"value"
and"other"
areDataFrame
columns. Let's assume they are bothfloat64
and may containNaN
s.If
"value"
isNaN
then consistent with item 2 above we skip the row. It doesn't matter what"other"
is here.If
"value"
is notNaN
but"other"
is, there are two possibilities for how to handle this.Option A
If
"other"
isNaN
, skip the row. Hence if either of"value"
or"other"
areNaN
we skip the row. This needs look-ahead functionality to check what"other"
is before we deal withds.max("value")
. The returned array can containNaN
s, which correspond to pixels for which there are no contributing rows.Option B
If
"other"
isNaN
treat it in the same way as any"other"
value. This therefore doesn't need the lookahead of"other"
, we just blindly copy the"other"
if it corresponds to the maximum"value"
. The returned array can containNaN
s. These correspond either to pixels for which there are no contributing rows, or pixels for which the only contributing rows have a non-NaN
"value"
but aNaN
"other"
.It seems to be that option A is the most logical. But I can see an argument for option B.
The text was updated successfully, but these errors were encountered: