ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724

thisisnic · 2021-07-15T14:23:21Z

This also makes the behavior of is.na() and is.nan() more consistent with base R

github-actions · 2021-07-15T14:23:41Z

https://issues.apache.org/jira/browse/ARROW-12964

jonkeane · 2021-07-15T17:12:14Z

Ok, I've pushed a few changes. The biggest one is that I've removed the assert_that call (since it wasn't quite what we wanted) and am not relying on the kernel dispatch to tell us if we have incompatible types.

We might want to have our own type checking, but it's not totally trivial with what we have now. We have some of the is.* methods after https://issues.apache.org/jira/browse/ARROW-12781, though for this what we really need is something like an extension of typeof() for expressions that return the type and then a function that compares those types + the R types and their arrow mappings to ensure that those are compatible. I think this is out of the scope for this PR (and might actually be deferrable until https://issues.apache.org/jira/browse/ARROW-13186 is done or possibly forever).

We don't support ifelse's autocasting abilities — I'm hesitant to even try since it's not a particularly stable or good behavior, though I wish there was a way we could message or warn explaining that / why we did that.

r/tests/testthat/test-dplyr.R

nealrichardson · 2021-07-15T18:19:49Z

r/R/dplyr-functions.R

+  # TODO: do this ^^^
+
+  if (inherits(true, "character") || inherits(false, "character")) {
+    stop("`true` and `false` character values not yet supported in Arrow")


Hmm this might be left over from above, I'll take this out (or add why it needs to stay) when I rebase + fix conflicts as well

it turns out only a limited set of types are supported right now ARROW-12955 has a PR to add other types. I've added some comments + tests around this and linked to that ticket as well. The types that are currently supported are: Boolean, Null, Numeric, Temporal

I've added a bit more testing for this. It's a bit hacky using nse_funcs$is.character(...). I haven't dug too deeply into if we should allow is.character(var) like was rejected here: https://issues.apache.org/jira/browse/ARROW-12781?focusedCommentId=17344170&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17344170 but even if we should, that's probably a separate ticket.

r/R/dplyr-functions.R

nealrichardson · 2021-07-16T14:41:20Z

r/R/dplyr-functions.R

+  invalid_r_types <- is.character(true) || is.character(false) || is.list(true) ||
+    is.list(false) || is.factor(true) || is.factor(false)
+  # However, if they are expressions, we need to use the functions from nse_funcs
+  invalid_expression_types_true <- inherits(true, "Expression") && (
+    nse_funcs$is.character(true) || nse_funcs$is.list(true) || nse_funcs$is.factor(true)
+  )
+  invalid_expression_types_false <- inherits(false, "Expression") && (
+    nse_funcs$is.character(false) || nse_funcs$is.list(false) || nse_funcs$is.factor(false)
+  )
+  if (invalid_r_types | invalid_expression_types_true | invalid_expression_types_false) {
+    stop("`true` and `false` character values not yet supported in Arrow", call. = FALSE)
+  }


Why do this at all? Why not let the C++ library tell us (by whether it succeeds or not) what it supports? This seems brittle and I don't want to have to maintain duplicated validation code if we don't have to.

Hopefully all of this will go away once I rebase (see below) though, one benefit to having this validation + error proactively than wait for the kernel dispatching to error is that if we do it here it will abandon ship and pull into R and work that way. If we let the kernel error we only get the error (and additionally no indication that doing something like collect() before this step would help with the situation)

turns out we're not quite there yet — the last (common) type that's not yet implemented still is factors/dictionaries. Currently they are auto decoded into strings — which is better than nothing, but not quite right.

I can use the (fragile) is.factor() approach here to warn that though we are getting factors, we are returning strings (for now) or we could silently do the conversation with no checking. Or, I guess we could also disable factors entirely, but that seems extreme. I vote that we warn so that there's no confusion about the change in type, even if it is a bit fragile.

Now that I'm looking at explicit dictionary support, what is the expectation for how dictionaries behave? Do we require that all inputs have the same exact dictionary, or should we merge dictionaries? It looks like base R/dplyr behave inconsistently here when the dictionaries differ:

> library(dplyr) Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union > fct1 <- factor(c("a", "b"), levels = c("a", "b", "c")) > fct2 <- factor(c("a", "d"), levels = c("a", "b", "d")) > int <- c(10, 2) > if_else(int > 5, fct1, fct2) [1] a <NA> Levels: a b c Warning message: In `[<-.factor`(`*tmp*`, i, value = 3L) : invalid factor level, NA generated > ifelse(int > 5, fct1, fct2) [1] 1 3

I would say that the base R version here is wrong / never what someone actually wants.

I don't think we want to get into the business of coalescing/merging the dictionaries to be the same (there are many edge cases that can lead to very funny outcomes). But emulating the dplyr behavior seems reasonable here (use the levels of the first, merge the values together and any value that's not in the levels of the first gets an NA + warning that the dictionaries didn't match)

Aaah, ok. Would it be possible to do something like this then: Use the levels of the first, merge the values together, and error on any value that's not in the levels of the first? (where we could redirect the person to either re-encode the dictionaries or NULL the offending values)

IME it's not uncommon to have a circumstance where you filter down to rows that have values that overlap (even though the full dictionaries are different) and forcing someone to re-encode there when no offending value would ever be there could be a bit frustrating.

Yeah, that's also reasonable - error only if we see a value we can't encode (and we could add a toggle to automatically null-encode it or just error).

That sounds great. Cause the null-encode is basically the only other safe option and is also pretty common to want

I'll try to also get this into the coalesce/select/if_else kernels when I get a chance (though I'll start with case_when since I'm already in there).

We don't have a way to emit warnings

Only partially related but this reminded me to create ARROW-13566 which might help with these sorts of situations

r/tests/testthat/test-dplyr.R

r/R/dplyr-functions.R

r/tests/testthat/helper-expectation.R

r/R/dplyr-functions.R

jonkeane · 2021-07-16T17:06:01Z

r/tests/testthat/test-dplyr.R

+      ) %>% collect(),
+    example_data_for_sorting
+  )


This skip is not great, I could probably write up a hacky helper that basically checks to see if the condition is.na and if it's not then only if the type is double do is.nan(). And maybe I should do that already so that we don't silently differ from expectations here...

https://issues.apache.org/jira/browse/ARROW-12055 is the "make NaN actually work" ticket

But dbl doesn't have NaN does it?

ARROW-12055 is now resolved by this PR

Can this skip now be removed?

~~Unfortunately~~ as it turns out, no. I've created https://issues.apache.org/jira/browse/ARROW-13364 to track this, but Arrow's comparison with NaNs results in false and not an NA(-like) value:

> example_data_for_sorting %>% mutate( + y = if_else(dbl > 5, chr, chr, missing = "MISSING") + ) %>% collect() # A tibble: 10 x 7 int dbl chr lgl dttm grp y <int> <dbl> <chr> <lgl> <dttm> <chr> <chr> 1 -2147483647 -Inf "" FALSE 0000-01-01 00:00:00 A "" 2 -101 -1.80e+308 "" FALSE 1919-05-29 13:08:55 A "" 3 -100 -2.23e-308 "\"" FALSE 1955-06-20 04:10:42 A "\"" 4 0 0 "&" FALSE 1973-06-30 11:38:41 A "&" 5 0 2.23e-308 "ABC" TRUE 1987-03-29 12:49:47 A "ABC" 6 1 3.14e+ 0 "NULL" TRUE 1991-06-11 19:07:01 B "NULL" 7 100 1.80e+308 "a" TRUE NA B "a" 8 1000 Inf "abc" TRUE 2017-08-21 18:26:40 B "abc" 9 2147483647 NaN "zzz" TRUE 2017-08-21 18:26:40 B "MISSING" 10 NA NA NA NA 9999-12-31 23:59:59 B "MISSING" > Table$create(example_data_for_sorting) %>% mutate( + y = if_else(dbl > 5, chr, chr, missing = "MISSING") + ) %>% collect() # A tibble: 10 x 7 int dbl chr lgl dttm grp y <int> <dbl> <chr> <lgl> <dttm> <chr> <chr> 1 -2147483647 -Inf "" FALSE 0000-01-01 00:00:00 A "" 2 -101 -1.80e+308 "" FALSE 1919-05-29 13:08:55 A "" 3 -100 -2.23e-308 "\"" FALSE 1955-06-20 04:10:42 A "\"" 4 0 0 "&" FALSE 1973-06-30 11:38:41 A "&" 5 0 2.23e-308 "ABC" TRUE 1987-03-29 12:49:47 A "ABC" 6 1 3.14e+ 0 "NULL" TRUE 1991-06-11 19:07:01 B "NULL" 7 100 1.80e+308 "a" TRUE NA B "a" 8 1000 Inf "abc" TRUE 2017-08-21 18:26:40 B "abc" 9 2147483647 NaN "zzz" TRUE 2017-08-21 18:26:40 B "zzz" 10 NA NA NA NA 9999-12-31 23:59:59 B "MISSING"

That 9th row NaN > 5 is evaluated to NA in R and therefore gets a missing value, where as in Arrow NaN > 5 evaluates to false so we get the "zzz" from the chr column

Gosh, missing data is hard 🤔

r/R/arrow-datum.R

r/R/dplyr-functions.R

r/tests/testthat/test-dplyr.R

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

r/R/arrow-datum.R

r/R/dplyr-functions.R

nealrichardson

A couple of final notes but this generally is looking good, thanks for the work on it

r/tests/testthat/test-dplyr.R

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

…ARROW-12964_ifelse

github-actions bot added the Component: R label Jul 15, 2021

thisisnic marked this pull request as ready for review July 15, 2021 15:16

nealrichardson reviewed Jul 15, 2021

View reviewed changes

jonkeane force-pushed the ARROW-12964_ifelse branch from ee206d5 to 794c96b Compare July 15, 2021 20:22

jonkeane requested a review from nealrichardson July 15, 2021 20:47

ianmcook reviewed Jul 16, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

thisisnic and others added 7 commits July 16, 2021 09:38

Add ifelse and if_else

fe6ac2e

Use build_expr not Expression$create

6b3a7aa

Add tests and warnings

613a1de

A few changes

1190697

A few more tests, slightly more guard rails for unimplemented types

cd7a2d9

ifelse -> if_else for generic tests

1ede692

take out errange ifelse from rebase

b6167fb

nealrichardson reviewed Jul 16, 2021

View reviewed changes

r/tests/testthat/test-dplyr.R Outdated Show resolved Hide resolved

Clean up, rebase

93f51a3

jonkeane force-pushed the ARROW-12964_ifelse branch from 3becca3 to 93f51a3 Compare July 16, 2021 15:33

jonkeane requested a review from nealrichardson July 16, 2021 15:33

jonkeane reviewed Jul 16, 2021

View reviewed changes

r/R/dplyr-functions.R Show resolved Hide resolved

jonkeane reviewed Jul 16, 2021

View reviewed changes

r/tests/testthat/helper-expectation.R Show resolved Hide resolved

nealrichardson reviewed Jul 16, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

CR comments + add support for the missing arg (mostly)

6f67ecb

jonkeane reviewed Jul 16, 2021

View reviewed changes

ianmcook added 5 commits July 16, 2021 13:09

Make nse_funcs$is.*() type check functions work on R literals

afe21ea

Simplify type warning code in if_else()

8024723

Resolve merge conflict

8ed814b

Fix misspelling

e876926

Fix bug in is.() functions

f450b33

ianmcook and others added 3 commits July 16, 2021 16:18

Make is.na() and is.nan() consistent with base R

b6e8b3b

Improve comment

0b14faa

Use the new is.na() functionality + edit warning about factors/dicts

7384206

nealrichardson reviewed Jul 16, 2021

View reviewed changes

r/R/arrow-datum.R Outdated Show resolved Hide resolved

nealrichardson reviewed Jul 16, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

nealrichardson reviewed Jul 16, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

nealrichardson reviewed Jul 16, 2021

View reviewed changes

r/tests/testthat/test-dplyr.R Show resolved Hide resolved

ianmcook and others added 5 commits July 16, 2021 15:29

Call is_valid instead of !is_null

2c1b535

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Call is_valid instead of !is_null

a6b5081

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Add TODOs with Jira refs

6c28763

Fix indentation

0aaec22

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Improve NEWS comment

52e6bc2

nealrichardson reviewed Jul 17, 2021

View reviewed changes

r/R/arrow-datum.R Outdated Show resolved Hide resolved

nealrichardson reviewed Jul 17, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

nealrichardson reviewed Jul 17, 2021

View reviewed changes

r/tests/testthat/test-dplyr.R Outdated Show resolved Hide resolved

jonkeane and others added 4 commits July 17, 2021 09:46

better tests

3045697

Update r/R/dplyr-functions.R

7bac428

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Update r/R/arrow-datum.R

4470e72

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Merge branch 'ARROW-12964_ifelse' of github.com:thisisnic/arrow into …

075cb7f

…ARROW-12964_ifelse

ianmcook mentioned this pull request Jul 17, 2021

ARROW-13200: [R] Add binding for case_when() #10737

Closed

nealrichardson approved these changes Jul 17, 2021

View reviewed changes

nealrichardson closed this in 45a2ae9 Jul 17, 2021

lidavidm mentioned this pull request Aug 4, 2021

ARROW-13222: [C++] Improve type support for case_when #10806

Closed

lidavidm mentioned this pull request Sep 6, 2021

ARROW-13573: [C++] Support dictionaries natively in case_when #11022

Closed

This was referenced Aug 5, 2021

[R] Add bindings for ifelse() and if_else() #28684

Closed

[C++] Support dictionaries directly in case_when kernel #29220

Closed

thisisnic mentioned this pull request Sep 14, 2024

[R] Let na.rm of mean() support removing NaN as in base R #44089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724

ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724

thisisnic commented Jul 15, 2021 •

edited by ianmcook

Loading

github-actions bot commented Jul 15, 2021

jonkeane commented Jul 15, 2021

nealrichardson Jul 15, 2021

jonkeane Jul 15, 2021

jonkeane Jul 15, 2021

jonkeane Jul 15, 2021

nealrichardson Jul 16, 2021

jonkeane Jul 16, 2021

jonkeane Jul 16, 2021

lidavidm Aug 4, 2021

jonkeane Aug 4, 2021

jonkeane Aug 4, 2021

lidavidm Aug 4, 2021

jonkeane Aug 4, 2021

lidavidm Aug 4, 2021

westonpace Aug 5, 2021 •

edited

Loading

jonkeane Jul 16, 2021

jonkeane Jul 16, 2021

nealrichardson Jul 16, 2021

ianmcook Jul 16, 2021

nealrichardson Jul 16, 2021

jonkeane Jul 16, 2021

nealrichardson Jul 17, 2021

nealrichardson left a comment

ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724

ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724

Conversation

thisisnic commented Jul 15, 2021 • edited by ianmcook Loading

github-actions bot commented Jul 15, 2021

jonkeane commented Jul 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace Aug 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

thisisnic commented Jul 15, 2021 •

edited by ianmcook

Loading

westonpace Aug 5, 2021 •

edited

Loading