-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724
Conversation
Ok, I've pushed a few changes. The biggest one is that I've removed the We might want to have our own type checking, but it's not totally trivial with what we have now. We have some of the We don't support |
r/R/dplyr-functions.R
Outdated
# TODO: do this ^^^ | ||
|
||
if (inherits(true, "character") || inherits(false, "character")) { | ||
stop("`true` and `false` character values not yet supported in Arrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this might be left over from above, I'll take this out (or add why it needs to stay) when I rebase + fix conflicts as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it turns out only a limited set of types are supported right now ARROW-12955 has a PR to add other types. I've added some comments + tests around this and linked to that ticket as well. The types that are currently supported are: Boolean, Null, Numeric, Temporal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bit more testing for this. It's a bit hacky using nse_funcs$is.character(...)
. I haven't dug too deeply into if we should allow is.character(var)
like was rejected here: https://issues.apache.org/jira/browse/ARROW-12781?focusedCommentId=17344170&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17344170 but even if we should, that's probably a separate ticket.
ee206d5
to
794c96b
Compare
r/R/dplyr-functions.R
Outdated
invalid_r_types <- is.character(true) || is.character(false) || is.list(true) || | ||
is.list(false) || is.factor(true) || is.factor(false) | ||
# However, if they are expressions, we need to use the functions from nse_funcs | ||
invalid_expression_types_true <- inherits(true, "Expression") && ( | ||
nse_funcs$is.character(true) || nse_funcs$is.list(true) || nse_funcs$is.factor(true) | ||
) | ||
invalid_expression_types_false <- inherits(false, "Expression") && ( | ||
nse_funcs$is.character(false) || nse_funcs$is.list(false) || nse_funcs$is.factor(false) | ||
) | ||
if (invalid_r_types | invalid_expression_types_true | invalid_expression_types_false) { | ||
stop("`true` and `false` character values not yet supported in Arrow", call. = FALSE) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this at all? Why not let the C++ library tell us (by whether it succeeds or not) what it supports? This seems brittle and I don't want to have to maintain duplicated validation code if we don't have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully all of this will go away once I rebase (see below) though, one benefit to having this validation + error proactively than wait for the kernel dispatching to error is that if we do it here it will abandon ship and pull into R and work that way. If we let the kernel error we only get the error (and additionally no indication that doing something like collect()
before this step would help with the situation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out we're not quite there yet — the last (common) type that's not yet implemented still is factors/dictionaries. Currently they are auto decoded into strings — which is better than nothing, but not quite right.
I can use the (fragile) is.factor()
approach here to warn that though we are getting factors, we are returning strings (for now) or we could silently do the conversation with no checking. Or, I guess we could also disable factors entirely, but that seems extreme. I vote that we warn so that there's no confusion about the change in type, even if it is a bit fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm looking at explicit dictionary support, what is the expectation for how dictionaries behave? Do we require that all inputs have the same exact dictionary, or should we merge dictionaries? It looks like base R/dplyr behave inconsistently here when the dictionaries differ:
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> fct1 <- factor(c("a", "b"), levels = c("a", "b", "c"))
> fct2 <- factor(c("a", "d"), levels = c("a", "b", "d"))
> int <- c(10, 2)
> if_else(int > 5, fct1, fct2)
[1] a <NA>
Levels: a b c
Warning message:
In `[<-.factor`(`*tmp*`, i, value = 3L) :
invalid factor level, NA generated
> ifelse(int > 5, fct1, fct2)
[1] 1 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that the base R version here is wrong / never what someone actually wants.
I don't think we want to get into the business of coalescing/merging the dictionaries to be the same (there are many edge cases that can lead to very funny outcomes). But emulating the dplyr behavior seems reasonable here (use the levels of the first, merge the values together and any value that's not in the levels of the first gets an NA + warning that the dictionaries didn't match)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah, ok. Would it be possible to do something like this then: Use the levels of the first, merge the values together, and error on any value that's not in the levels of the first? (where we could redirect the person to either re-encode the dictionaries or NULL
the offending values)
IME it's not uncommon to have a circumstance where you filter down to rows that have values that overlap (even though the full dictionaries are different) and forcing someone to re-encode there when no offending value would ever be there could be a bit frustrating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's also reasonable - error only if we see a value we can't encode (and we could add a toggle to automatically null-encode it or just error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds great. Cause the null-encode is basically the only other safe option and is also pretty common to want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to also get this into the coalesce/select/if_else kernels when I get a chance (though I'll start with case_when since I'm already in there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a way to emit warnings
Only partially related but this reminded me to create ARROW-13566 which might help with these sorts of situations
3becca3
to
93f51a3
Compare
) %>% collect(), | ||
example_data_for_sorting | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skip is not great, I could probably write up a hacky helper that basically checks to see if the condition is.na
and if it's not then only if the type is double do is.nan()
. And maybe I should do that already so that we don't silently differ from expectations here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/ARROW-12055 is the "make NaN
actually work" ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But dbl
doesn't have NaN
does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW-12055 is now resolved by this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this skip now be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately as it turns out, no. I've created https://issues.apache.org/jira/browse/ARROW-13364 to track this, but Arrow's comparison with NaN
s results in false
and not an NA
(-like) value:
> example_data_for_sorting %>% mutate(
+ y = if_else(dbl > 5, chr, chr, missing = "MISSING")
+ ) %>% collect()
# A tibble: 10 x 7
int dbl chr lgl dttm grp y
<int> <dbl> <chr> <lgl> <dttm> <chr> <chr>
1 -2147483647 -Inf "" FALSE 0000-01-01 00:00:00 A ""
2 -101 -1.80e+308 "" FALSE 1919-05-29 13:08:55 A ""
3 -100 -2.23e-308 "\"" FALSE 1955-06-20 04:10:42 A "\""
4 0 0 "&" FALSE 1973-06-30 11:38:41 A "&"
5 0 2.23e-308 "ABC" TRUE 1987-03-29 12:49:47 A "ABC"
6 1 3.14e+ 0 "NULL" TRUE 1991-06-11 19:07:01 B "NULL"
7 100 1.80e+308 "a" TRUE NA B "a"
8 1000 Inf "abc" TRUE 2017-08-21 18:26:40 B "abc"
9 2147483647 NaN "zzz" TRUE 2017-08-21 18:26:40 B "MISSING"
10 NA NA NA NA 9999-12-31 23:59:59 B "MISSING"
> Table$create(example_data_for_sorting) %>% mutate(
+ y = if_else(dbl > 5, chr, chr, missing = "MISSING")
+ ) %>% collect()
# A tibble: 10 x 7
int dbl chr lgl dttm grp y
<int> <dbl> <chr> <lgl> <dttm> <chr> <chr>
1 -2147483647 -Inf "" FALSE 0000-01-01 00:00:00 A ""
2 -101 -1.80e+308 "" FALSE 1919-05-29 13:08:55 A ""
3 -100 -2.23e-308 "\"" FALSE 1955-06-20 04:10:42 A "\""
4 0 0 "&" FALSE 1973-06-30 11:38:41 A "&"
5 0 2.23e-308 "ABC" TRUE 1987-03-29 12:49:47 A "ABC"
6 1 3.14e+ 0 "NULL" TRUE 1991-06-11 19:07:01 B "NULL"
7 100 1.80e+308 "a" TRUE NA B "a"
8 1000 Inf "abc" TRUE 2017-08-21 18:26:40 B "abc"
9 2147483647 NaN "zzz" TRUE 2017-08-21 18:26:40 B "zzz"
10 NA NA NA NA 9999-12-31 23:59:59 B "MISSING"
That 9th row NaN > 5
is evaluated to NA
in R and therefore gets a missing value, where as in Arrow NaN > 5
evaluates to false
so we get the "zzz"
from the chr
column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh, missing data is hard 🤔
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of final notes but this generally is looking good, thanks for the work on it
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
…ARROW-12964_ifelse
This also makes the behavior of
is.na()
andis.nan()
more consistent with base R