Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724

Closed
wants to merge 26 commits into from

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Jul 15, 2021

This also makes the behavior of is.na() and is.nan() more consistent with base R

@github-actions
Copy link

@thisisnic thisisnic marked this pull request as ready for review July 15, 2021 15:16
@jonkeane
Copy link
Member

Ok, I've pushed a few changes. The biggest one is that I've removed the assert_that call (since it wasn't quite what we wanted) and am not relying on the kernel dispatch to tell us if we have incompatible types.

We might want to have our own type checking, but it's not totally trivial with what we have now. We have some of the is.* methods after https://issues.apache.org/jira/browse/ARROW-12781, though for this what we really need is something like an extension of typeof() for expressions that return the type and then a function that compares those types + the R types and their arrow mappings to ensure that those are compatible. I think this is out of the scope for this PR (and might actually be deferrable until https://issues.apache.org/jira/browse/ARROW-13186 is done or possibly forever).

We don't support ifelse's autocasting abilities — I'm hesitant to even try since it's not a particularly stable or good behavior, though I wish there was a way we could message or warn explaining that / why we did that.

# TODO: do this ^^^

if (inherits(true, "character") || inherits(false, "character")) {
stop("`true` and `false` character values not yet supported in Arrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this might be left over from above, I'll take this out (or add why it needs to stay) when I rebase + fix conflicts as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out only a limited set of types are supported right now ARROW-12955 has a PR to add other types. I've added some comments + tests around this and linked to that ticket as well. The types that are currently supported are: Boolean, Null, Numeric, Temporal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a bit more testing for this. It's a bit hacky using nse_funcs$is.character(...). I haven't dug too deeply into if we should allow is.character(var) like was rejected here: https://issues.apache.org/jira/browse/ARROW-12781?focusedCommentId=17344170&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17344170 but even if we should, that's probably a separate ticket.

@jonkeane jonkeane force-pushed the ARROW-12964_ifelse branch from ee206d5 to 794c96b Compare July 15, 2021 20:22
@jonkeane jonkeane requested a review from nealrichardson July 15, 2021 20:47
Comment on lines 666 to 669
invalid_r_types <- is.character(true) || is.character(false) || is.list(true) ||
is.list(false) || is.factor(true) || is.factor(false)
# However, if they are expressions, we need to use the functions from nse_funcs
invalid_expression_types_true <- inherits(true, "Expression") && (
nse_funcs$is.character(true) || nse_funcs$is.list(true) || nse_funcs$is.factor(true)
)
invalid_expression_types_false <- inherits(false, "Expression") && (
nse_funcs$is.character(false) || nse_funcs$is.list(false) || nse_funcs$is.factor(false)
)
if (invalid_r_types | invalid_expression_types_true | invalid_expression_types_false) {
stop("`true` and `false` character values not yet supported in Arrow", call. = FALSE)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this at all? Why not let the C++ library tell us (by whether it succeeds or not) what it supports? This seems brittle and I don't want to have to maintain duplicated validation code if we don't have to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully all of this will go away once I rebase (see below) though, one benefit to having this validation + error proactively than wait for the kernel dispatching to error is that if we do it here it will abandon ship and pull into R and work that way. If we let the kernel error we only get the error (and additionally no indication that doing something like collect() before this step would help with the situation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out we're not quite there yet — the last (common) type that's not yet implemented still is factors/dictionaries. Currently they are auto decoded into strings — which is better than nothing, but not quite right.

I can use the (fragile) is.factor() approach here to warn that though we are getting factors, we are returning strings (for now) or we could silently do the conversation with no checking. Or, I guess we could also disable factors entirely, but that seems extreme. I vote that we warn so that there's no confusion about the change in type, even if it is a bit fragile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm looking at explicit dictionary support, what is the expectation for how dictionaries behave? Do we require that all inputs have the same exact dictionary, or should we merge dictionaries? It looks like base R/dplyr behave inconsistently here when the dictionaries differ:

> library(dplyr)

Attaching package:dplyrThe following objects are masked frompackage:stats:

    filter, lag

The following objects are masked frompackage:base:

    intersect, setdiff, setequal, union

> fct1 <- factor(c("a", "b"), levels = c("a", "b", "c"))
> fct2 <- factor(c("a", "d"), levels = c("a", "b", "d"))
> int <- c(10, 2)
> if_else(int > 5, fct1, fct2)
[1] a    <NA>
Levels: a b c
Warning message:
In `[<-.factor`(`*tmp*`, i, value = 3L) :
  invalid factor level, NA generated
> ifelse(int > 5, fct1, fct2)
[1] 1 3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that the base R version here is wrong / never what someone actually wants.

I don't think we want to get into the business of coalescing/merging the dictionaries to be the same (there are many edge cases that can lead to very funny outcomes). But emulating the dplyr behavior seems reasonable here (use the levels of the first, merge the values together and any value that's not in the levels of the first gets an NA + warning that the dictionaries didn't match)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah, ok. Would it be possible to do something like this then: Use the levels of the first, merge the values together, and error on any value that's not in the levels of the first? (where we could redirect the person to either re-encode the dictionaries or NULL the offending values)

IME it's not uncommon to have a circumstance where you filter down to rows that have values that overlap (even though the full dictionaries are different) and forcing someone to re-encode there when no offending value would ever be there could be a bit frustrating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's also reasonable - error only if we see a value we can't encode (and we could add a toggle to automatically null-encode it or just error).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds great. Cause the null-encode is basically the only other safe option and is also pretty common to want

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to also get this into the coalesce/select/if_else kernels when I get a chance (though I'll start with case_when since I'm already in there).

Copy link
Member

@westonpace westonpace Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a way to emit warnings

Only partially related but this reminded me to create ARROW-13566 which might help with these sorts of situations

@jonkeane jonkeane force-pushed the ARROW-12964_ifelse branch from 3becca3 to 93f51a3 Compare July 16, 2021 15:33
@jonkeane jonkeane requested a review from nealrichardson July 16, 2021 15:33
) %>% collect(),
example_data_for_sorting
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skip is not great, I could probably write up a hacky helper that basically checks to see if the condition is.na and if it's not then only if the type is double do is.nan(). And maybe I should do that already so that we don't silently differ from expectations here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/ARROW-12055 is the "make NaN actually work" ticket

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But dbl doesn't have NaN does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW-12055 is now resolved by this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this skip now be removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately as it turns out, no. I've created https://issues.apache.org/jira/browse/ARROW-13364 to track this, but Arrow's comparison with NaNs results in false and not an NA(-like) value:

> example_data_for_sorting %>% mutate(
+     y = if_else(dbl > 5, chr, chr, missing = "MISSING")
+ ) %>% collect()
# A tibble: 10 x 7
           int          dbl chr    lgl   dttm                grp   y        
         <int>        <dbl> <chr>  <lgl> <dttm>              <chr> <chr>    
 1 -2147483647 -Inf         ""     FALSE 0000-01-01 00:00:00 A     ""       
 2        -101   -1.80e+308 ""     FALSE 1919-05-29 13:08:55 A     ""       
 3        -100   -2.23e-308 "\""   FALSE 1955-06-20 04:10:42 A     "\""     
 4           0    0         "&"    FALSE 1973-06-30 11:38:41 A     "&"      
 5           0    2.23e-308 "ABC"  TRUE  1987-03-29 12:49:47 A     "ABC"    
 6           1    3.14e+  0 "NULL" TRUE  1991-06-11 19:07:01 B     "NULL"   
 7         100    1.80e+308 "a"    TRUE  NA                  B     "a"      
 8        1000  Inf         "abc"  TRUE  2017-08-21 18:26:40 B     "abc"    
 9  2147483647  NaN         "zzz"  TRUE  2017-08-21 18:26:40 B     "MISSING"
10          NA   NA          NA    NA    9999-12-31 23:59:59 B     "MISSING"
> Table$create(example_data_for_sorting) %>% mutate(
+     y = if_else(dbl > 5, chr, chr, missing = "MISSING")
+ ) %>% collect()
# A tibble: 10 x 7
           int          dbl chr    lgl   dttm                grp   y        
         <int>        <dbl> <chr>  <lgl> <dttm>              <chr> <chr>    
 1 -2147483647 -Inf         ""     FALSE 0000-01-01 00:00:00 A     ""       
 2        -101   -1.80e+308 ""     FALSE 1919-05-29 13:08:55 A     ""       
 3        -100   -2.23e-308 "\""   FALSE 1955-06-20 04:10:42 A     "\""     
 4           0    0         "&"    FALSE 1973-06-30 11:38:41 A     "&"      
 5           0    2.23e-308 "ABC"  TRUE  1987-03-29 12:49:47 A     "ABC"    
 6           1    3.14e+  0 "NULL" TRUE  1991-06-11 19:07:01 B     "NULL"   
 7         100    1.80e+308 "a"    TRUE  NA                  B     "a"      
 8        1000  Inf         "abc"  TRUE  2017-08-21 18:26:40 B     "abc"    
 9  2147483647  NaN         "zzz"  TRUE  2017-08-21 18:26:40 B     "zzz"    
10          NA   NA          NA    NA    9999-12-31 23:59:59 B     "MISSING"

That 9th row NaN > 5 is evaluated to NA in R and therefore gets a missing value, where as in Arrow NaN > 5 evaluates to false so we get the "zzz" from the chr column

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, missing data is hard 🤔

ianmcook and others added 5 commits July 16, 2021 15:29
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of final notes but this generally is looking good, thanks for the work on it

jonkeane and others added 4 commits July 17, 2021 09:46
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants