Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12964: [R] Add bindings for ifelse() and if_else() #10724

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
fe6ac2e
Add ifelse and if_else
thisisnic Jul 15, 2021
6b3a7aa
Use build_expr not Expression$create
thisisnic Jul 15, 2021
613a1de
Add tests and warnings
thisisnic Jul 15, 2021
1190697
A few changes
jonkeane Jul 15, 2021
cd7a2d9
A few more tests, slightly more guard rails for unimplemented types
jonkeane Jul 15, 2021
1ede692
ifelse -> if_else for generic tests
jonkeane Jul 15, 2021
b6167fb
take out errange ifelse from rebase
jonkeane Jul 15, 2021
93f51a3
Clean up, rebase
jonkeane Jul 16, 2021
6f67ecb
CR comments + add support for the missing arg (mostly)
jonkeane Jul 16, 2021
afe21ea
Make nse_funcs$is.*() type check functions work on R literals
ianmcook Jul 16, 2021
8024723
Simplify type warning code in if_else()
ianmcook Jul 16, 2021
8ed814b
Resolve merge conflict
ianmcook Jul 16, 2021
e876926
Fix misspelling
ianmcook Jul 16, 2021
f450b33
Fix bug in is.() functions
ianmcook Jul 16, 2021
b6e8b3b
Make is.na() and is.nan() consistent with base R
ianmcook Jul 16, 2021
0b14faa
Improve comment
ianmcook Jul 16, 2021
7384206
Use the new is.na() functionality + edit warning about factors/dicts
jonkeane Jul 16, 2021
2c1b535
Call is_valid instead of !is_null
ianmcook Jul 16, 2021
a6b5081
Call is_valid instead of !is_null
ianmcook Jul 16, 2021
6c28763
Add TODOs with Jira refs
ianmcook Jul 16, 2021
0aaec22
Fix indentation
ianmcook Jul 16, 2021
52e6bc2
Improve NEWS comment
ianmcook Jul 16, 2021
3045697
better tests
jonkeane Jul 17, 2021
7bac428
Update r/R/dplyr-functions.R
jonkeane Jul 17, 2021
4470e72
Update r/R/arrow-datum.R
jonkeane Jul 17, 2021
075cb7f
Merge branch 'ARROW-12964_ifelse' of github.com:thisisnic/arrow into …
jonkeane Jul 17, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 38 additions & 9 deletions r/R/dplyr-functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -294,8 +294,8 @@ nse_funcs$substr <- function(x, start, stop) {
msg = "`stop` must be length 1 - other lengths are not supported in Arrow"
)

# substr treats values as if they're on a continous number line, so values
# 0 are effectively blank characters - set `start` to 1 here so Arrow mimics
# substr treats values as if they're on a continous number line, so values
# 0 are effectively blank characters - set `start` to 1 here so Arrow mimics
# this behavior
if (start <= 0) {
start <- 1
Expand All @@ -310,7 +310,7 @@ nse_funcs$substr <- function(x, start, stop) {
Expression$create(
"utf8_slice_codeunits",
x,
# we don't need to subtract 1 from `stop` as C++ counts exclusively
# we don't need to subtract 1 from `stop` as C++ counts exclusively
# which effectively cancels out the difference in indexing between R & C++
options = list(start = start - 1L, stop = stop)
)
Expand All @@ -336,14 +336,14 @@ nse_funcs$str_sub <- function(string, start = 1L, end = -1L) {
end <- .Machine$integer.max
}

# An end value lower than a start value returns an empty string in
# An end value lower than a start value returns an empty string in
# stringr::str_sub so set end to 0 here to match this behavior
if (end < start) {
end <- 0
}

# subtract 1 from `start` because C++ is 0-based and R is 1-based
# str_sub treats a `start` value of 0 or 1 as the same thing so don't subtract 1 when `start` == 0
# str_sub treats a `start` value of 0 or 1 as the same thing so don't subtract 1 when `start` == 0
# when `start` < 0, both str_sub and utf8_slice_codeunits count backwards from the end
if (start > 0) {
start <- start - 1L
Expand Down Expand Up @@ -634,20 +634,49 @@ nse_funcs$wday <- function(x, label = FALSE, abbr = TRUE, week_start = getOption
}

nse_funcs$log <- function(x, base = exp(1)) {

if (base == exp(1)) {
return(Expression$create("ln_checked", x))
}

if (base == 2) {
return(Expression$create("log2_checked", x))
}

if (base == 10) {
return(Expression$create("log10_checked", x))
}
}
# ARROW-13345
stop("`base` values other than exp(1), 2 and 10 not supported in Arrow", call. = FALSE)
}

nse_funcs$logb <- nse_funcs$log

nse_funcs$if_else <- function(condition, true, false, missing = NULL){
if (!is.null(missing)) {
return(nse_funcs$if_else(
Expression$create("is_null", condition),
missing,
nse_funcs$if_else(condition, true, false)
))
}

# TODO: if_else doesn't yet support factors/dictionaries this can be removed when
# ARROW-13358 merges
warn_r_types <- is.factor(true) || is.factor(false)
# However, if they are expressions, we need to use the functions from nse_funcs
warn_expression_types_true <- inherits(true, "Expression") && nse_funcs$is.factor(true)
warn_expression_types_false <- inherits(false, "Expression") && nse_funcs$is.factor(false)

if (warn_r_types | warn_expression_types_true | warn_expression_types_false) {
warning("Factors are currently converted to characters in if_else and ifelse", call. = FALSE)
}

build_expr("if_else", condition, true, false)
}

# Although base R ifelse allows `yes` and `no` to be different classes
#
nse_funcs$ifelse <- function(test, yes, no){
nse_funcs$if_else(condition = test, true = yes, false = no)
}
3 changes: 2 additions & 1 deletion r/tests/testthat/helper-expectation.R
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,8 @@ expect_dplyr_equal <- function(expr,

if (isTRUE(warning)) {
# Special-case the simple warning:
warning <- "not supported in Arrow; pulling data into R"
# TODO: ARROW-13362 pick one of in or by and use it everywhere
warning <- "not supported (in|by) Arrow; pulling data into R"
}

skip_msg <- NULL
Expand Down
154 changes: 134 additions & 20 deletions r/tests/testthat/test-dplyr.R
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ chr: string
See $.data for the source Arrow object',
fixed = TRUE
)

})

test_that("summarize", {
Expand Down Expand Up @@ -821,7 +821,7 @@ test_that("type checks on expressions", {
collect(),
tbl
)

# the code in the expectation below depends on RE2
skip_if_not_available("re2")

Expand Down Expand Up @@ -947,64 +947,64 @@ test_that("abs()", {
})

test_that("log functions", {

df <- tibble(x = c(1:10, NA, NA))

expect_dplyr_equal(
input %>%
mutate(y = log(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = log(x, base = exp(1))) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = log(x, base = 2)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = log(x, base = 10)) %>%
collect(),
df
)

expect_error(
nse_funcs$log(Expression$scalar(x), base = 5),
"`base` values other than exp(1), 2 and 10 not supported in Arrow",
fixed = TRUE
)

expect_dplyr_equal(
input %>%
mutate(y = logb(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = log1p(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = log2(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = log10(x)) %>%
Expand All @@ -1013,44 +1013,158 @@ test_that("log functions", {
)

})

test_that("trig functions", {

df <- tibble(x = c(seq(from = 0, to = 1, by = 0.1), NA))

expect_dplyr_equal(
input %>%
mutate(y = sin(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = cos(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = tan(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = asin(x)) %>%
collect(),
df
)

expect_dplyr_equal(
input %>%
mutate(y = acos(x)) %>%
collect(),
df
)

})
})

test_that("if_else and ifelse", {
expect_dplyr_equal(
input %>%
mutate(
y = if_else(int > 5, 1, 0)
) %>% collect(),
example_data
)

expect_dplyr_equal(
input %>%
mutate(
y = if_else(int > 5, int, 0L)
) %>% collect(),
example_data
)

expect_error(
Table$create(example_data) %>%
mutate(
y = if_else(int > 5, 1, FALSE)
) %>% collect(),
'NotImplemented: Function if_else has no kernel matching input types'
)

expect_dplyr_equal(
input %>%
mutate(
y = if_else(int > 5, 1, NA_real_)
) %>% collect(),
example_data
)

expect_dplyr_equal(
input %>%
mutate(
y = ifelse(int > 5, 1, 0)
) %>% collect(),
example_data
)

expect_dplyr_equal(
input %>%
mutate(
y = if_else(dbl > 5, TRUE, FALSE)
) %>% collect(),
example_data
)

expect_dplyr_equal(
input %>%
mutate(
y = if_else(chr %in% letters[1:3], 1L, 3L)
) %>% collect(),
example_data
)

expect_dplyr_equal(
input %>%
mutate(
y = if_else(int > 5, "one", "zero")
) %>% collect(),
example_data
)

expect_dplyr_equal(
input %>%
mutate(
y = if_else(int > 5, chr, chr)
) %>% collect(),
example_data
)

expect_dplyr_equal(
input %>%
mutate(
y = if_else(int > 5, chr, chr, missing = "MISSING")
) %>% collect(),
example_data
)

# TODO: remove the mutate + warning after ARROW-13358 is merged and Arrow
# supports factors in if(_)else
expect_dplyr_equal(
input %>%
mutate(
y = if_else(int > 5, fct, factor("a"))
) %>% collect() %>%
# This is a no-op on the Arrow side, but necesary to make the results equal
mutate(y = as.character(y)),
example_data,
warning = "Factors are currently converted to characters in if_else and ifelse"
)

skip("ARROW-12055 for better NaN support")
# currently NaNs are not NAs and so the missing argument is not correctly
# applied
expect_dplyr_equal(
input %>%
mutate(
y = if_else(dbl > 5, chr, chr, missing = "MISSING")
) %>% collect(),
example_data_for_sorting
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skip is not great, I could probably write up a hacky helper that basically checks to see if the condition is.na and if it's not then only if the type is double do is.nan(). And maybe I should do that already so that we don't silently differ from expectations here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/ARROW-12055 is the "make NaN actually work" ticket

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But dbl doesn't have NaN does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW-12055 is now resolved by this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this skip now be removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately as it turns out, no. I've created https://issues.apache.org/jira/browse/ARROW-13364 to track this, but Arrow's comparison with NaNs results in false and not an NA(-like) value:

> example_data_for_sorting %>% mutate(
+     y = if_else(dbl > 5, chr, chr, missing = "MISSING")
+ ) %>% collect()
# A tibble: 10 x 7
           int          dbl chr    lgl   dttm                grp   y        
         <int>        <dbl> <chr>  <lgl> <dttm>              <chr> <chr>    
 1 -2147483647 -Inf         ""     FALSE 0000-01-01 00:00:00 A     ""       
 2        -101   -1.80e+308 ""     FALSE 1919-05-29 13:08:55 A     ""       
 3        -100   -2.23e-308 "\""   FALSE 1955-06-20 04:10:42 A     "\""     
 4           0    0         "&"    FALSE 1973-06-30 11:38:41 A     "&"      
 5           0    2.23e-308 "ABC"  TRUE  1987-03-29 12:49:47 A     "ABC"    
 6           1    3.14e+  0 "NULL" TRUE  1991-06-11 19:07:01 B     "NULL"   
 7         100    1.80e+308 "a"    TRUE  NA                  B     "a"      
 8        1000  Inf         "abc"  TRUE  2017-08-21 18:26:40 B     "abc"    
 9  2147483647  NaN         "zzz"  TRUE  2017-08-21 18:26:40 B     "MISSING"
10          NA   NA          NA    NA    9999-12-31 23:59:59 B     "MISSING"
> Table$create(example_data_for_sorting) %>% mutate(
+     y = if_else(dbl > 5, chr, chr, missing = "MISSING")
+ ) %>% collect()
# A tibble: 10 x 7
           int          dbl chr    lgl   dttm                grp   y        
         <int>        <dbl> <chr>  <lgl> <dttm>              <chr> <chr>    
 1 -2147483647 -Inf         ""     FALSE 0000-01-01 00:00:00 A     ""       
 2        -101   -1.80e+308 ""     FALSE 1919-05-29 13:08:55 A     ""       
 3        -100   -2.23e-308 "\""   FALSE 1955-06-20 04:10:42 A     "\""     
 4           0    0         "&"    FALSE 1973-06-30 11:38:41 A     "&"      
 5           0    2.23e-308 "ABC"  TRUE  1987-03-29 12:49:47 A     "ABC"    
 6           1    3.14e+  0 "NULL" TRUE  1991-06-11 19:07:01 B     "NULL"   
 7         100    1.80e+308 "a"    TRUE  NA                  B     "a"      
 8        1000  Inf         "abc"  TRUE  2017-08-21 18:26:40 B     "abc"    
 9  2147483647  NaN         "zzz"  TRUE  2017-08-21 18:26:40 B     "zzz"    
10          NA   NA          NA    NA    9999-12-31 23:59:59 B     "MISSING"

That 9th row NaN > 5 is evaluated to NA in R and therefore gets a missing value, where as in Arrow NaN > 5 evaluates to false so we get the "zzz" from the chr column

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, missing data is hard 🤔


skip("TODO: could? should? we support the autocasting in ifelse")
expect_dplyr_equal(
input %>%
mutate(y = ifelse(int > 5, 1, FALSE)) %>%
collect(),
example_data
)
})