Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditionally mutate selected rows #4050

Open
krlmlr opened this issue Dec 21, 2018 · 12 comments
Open

Conditionally mutate selected rows #4050

krlmlr opened this issue Dec 21, 2018 · 12 comments
Labels
columns ↔️ Operations on columns: mutate(), select(), rename(), relocate() feature a feature request or enhancement

Comments

@krlmlr
Copy link
Member

krlmlr commented Dec 21, 2018

This would allow supporting an efficient mutate_if_row() verb here or elsewhere (assuming there's also a nice way to set the group data, as implemented in update_group_data() here). I remember a discussion about using the group data for other exciting things such as bootstrapping?

In the example below, the first three rows should remain unchanged.

library(tidyverse)

df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

update_group_data <- function(.data, group_data) {
  attr(.data, "groups") <- group_data
  .data
}

group_filter <- function(.data, ...) {
  new_group_data <-
    .data %>%
    group_data() %>%
    filter(...)

  .data %>%
    update_group_data(new_group_data)
}

mutate_if_row <- function(.data, cond, ...) {
  cond <- rlang::enquo(cond)

  .data %>%
    group_by(.flag = !!cond) %>%
    group_filter(.flag) %>%
    mutate(...) %>%
    ungroup() %>%
    select(-.flag)
}

df %>%
  mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1    NA
#> 2    NA
#> 3    NA
#> 4     5
#> 5     6

Created on 2018-12-21 by the reprex package (v0.2.1.9000)

@romainfrancois romainfrancois added the feature a feature request or enhancement label Dec 21, 2018
@krlmlr
Copy link
Member Author

krlmlr commented Dec 21, 2018

We can fake it already, but overwriting would be a tad faster:

library(tidyverse)

df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

if_flag <- function(quo, name) {
  rlang::quo_set_expr(
    quo,
    expr(if (.flag[1]) !!rlang::quo_get_expr(quo) else !!rlang::sym(name))
  )
}

mutate_if_row <- function(.data, cond, ...) {
  cond <- rlang::enquo(cond)
  quos <- rlang::quos(...)

  quos <- map2(quos, names(quos), if_flag)

  .data %>%
    group_by(.flag = !!cond) %>%
    mutate(!!!quos) %>%
    ungroup() %>%
    select(-.flag)
}

df %>%
  mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     5
#> 5     6

Created on 2018-12-21 by the reprex package (v0.2.1.9000)

@romainfrancois
Copy link
Member

romainfrancois commented Dec 21, 2018

Also not that trivial to implement. We can only realistically do that when R says the object has only one reference.

This, to me, looks like modify by reference, à la data.table, and is out of scope for dplyr.

This sounds like a use case for case_when:

library(dplyr)

df <- tibble(a = 1:5)

df %>%
  mutate(a = case_when(
    a > 3 ~ a + 1L, 
    TRUE  ~ a
  ))
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     5
#> 5     6

Created on 2018-12-21 by the reprex package (v0.2.1.9000)

@krlmlr
Copy link
Member Author

krlmlr commented Dec 21, 2018

mutate_if_row() is better, less noise and works for updating multiple columns at once. I've heard this question now multiple times in workshops.

I see your point, we need to copy anyway, even if R says it has only one copy. Copying via memcpy() or R's duplication mechanism still will be faster.

Maybe something to consider for 0.9.0?

@romainfrancois
Copy link
Member

I see, I think I've been confused by the group_by(). see also mutate_when() from https://gist.github.com/romainfrancois/eeeed972d6734bcad3ec3dcf872df7ea

library(rlang)
library(dplyr)
library(purrr)

mutate_when <- function(data, condition, ...){
  condition <- enquo(condition)
  
  dots <- exprs(...)
  
  expressions <- map2( dots, syms(names(dots)), ~{
    quo( case_when(..condition.. ~ !!.x , TRUE ~ !!.y ) )
  })
  
  data %>%
    mutate( ..condition.. = !!condition ) %>%
    mutate( !!!expressions ) %>%
    select( -..condition..)
}

d <- tibble( x = 1:4, y = 1:4)
mutate_when( d, x < 3, 
  x = -x, 
  y = -y
)
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3     3     3
#> 4     4     4

Created on 2019-01-29 by the reprex package (v0.2.1.9000)

@hadley hadley changed the title FR: When overwriting column, reuse existing data Conditional mutate selected rows Dec 11, 2019
@hadley hadley changed the title Conditional mutate selected rows Conditionally mutate selected rows Dec 31, 2019
@romainfrancois
Copy link
Member

Here are some approaches using data frame returns:

  • "manually":
library(dplyr)
d <- tibble( x = 1:4, y = 1:4)

# using data frame returns
d %>% 
  mutate({
    test <- x < 4
    x[test] <- -x[test]
    y[test] <- -y[test]
    data.frame(x = x, y = y)
  })
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

if we want to do the same thing to a selected set of columns, we can use across() and a bit of code around:

# using across()
d %>% 
  mutate({
    test <- x < 4
    across(c(x, y), ~ {.x[test] <- -.x[test]; .x })
  })
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

and we can further abstract, e.g.

negate_if <- function(condition, cols) {
  across({{ cols }}, ~ {
    .x[condition] <- -.x[condition]
    .x
  })
}
d %>% 
  mutate(negate_if(x < 4, c(x, y)))
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

Now if we want to do arbitrary mutations, e.g. mutate_when(d, x < 4, x = -x, y = -y) we can do something like this, with some assumptions:

mutate_when <- function(.data, when, ...) {
  dots <- enquos(...)
  names <- names(dots)
  
  mutate(.data, {
    test <- {{ when }}
    
    changed <- data.frame(!!!dots)
    out <- across(all_of(names))
    # assuming `changed` and `out` have the same data frame type

    out[test, ] <- changed[test, ]
    out
  })
  
}
mutate_when(d, x < 4, x = -x, y = -y)
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

Created on 2021-04-21 by the reprex package (v0.3.0)

This all feels like things we can do with the tools available, perhaps in some other package ?

@k6adams
Copy link

k6adams commented Nov 24, 2021

I just wanted to mention mutate_when() is a function I would love to see incorporated. I posted this question on stack overflow, essentially asking if there was a more simplistic syntax for creating variables in a mutate, without a bunch of repetitive case_when() or ifelse() statements.

In my particular use case, I am creating code which creates output based upon a flow chart. My intended end users are less familiar with R, and I don't want them to get overwhelmed by the sheer volume of repetitive code. IMO, this mutate_when() function is intuitive in conjunction with pipe %>%.

Because I am naive and new, I thought something like this might work...

data %>%
group_by(g1, g2) %>%
mutate(
   across(
      where(
         condition
      )
   ),
var1 = "happy",
var2 = var22 / var19 + 3,
var3 = ifelse(
   var2 >= 3,
   TRUE,
   FALSE
   ),
...more var statements...
 )

Thanks @romainfrancois for posting this function.

@DavisVaughan
Copy link
Member

We gave a serious attempt at this in #6313 for dplyr 1.1.0, but ultimately decided not to add it in that release.

We aren't convinced that it is an operation that would be heavily used, as the main example usage we could come up with was replacing missing values, i.e.:

mutate(df, x = 0, .when = is.na(x))

We can't think of many examples beyond this one where this would be very useful.

Here are a few notes we should consider in the future when thinking about this:

  • Should groups be ignored when computing .when? To match SQL and data table, it makes sense to ignore groups. I also can't think of any examples where a grouped application of .when makes sense. But this confused some people, especially because they might be passing a grouped-df in, like group_by() %>% mutate(.when =). This becomes slightly less confusing in the context of .by, i.e. mutate(.when =, .by = ), where we'd just document that .when is applied first.

  • To be performant, we have to hook this into the data mask. You have to evaluate .when to get the locations where ... should be applied, and then only the columns referenced in ... should be sliced to the locations referenced by .when. We did this successfully in Implement mutate(.when =) #6313, but it required a decent chunk of refactoring.

  • Would you ever want more than 1 .when call in a single mutate()? Some people proposed an API of mutate(when(is.na(x), x = 0), when(y == 4, a = 5, b = 6)). I don't personally think this would be that useful. If we did this, we might also consider making sequential when() calls work in a case-when like fashion.

  • Should .when allow if_any() and if_all() in the expression? It seems like they might be useful as a way to compute a complex when expression based on multiple columns, but is somewhat hard to implement. We didn't do that in Implement mutate(.when =) #6313.

We have to think about how useful this function is in light of the fact that we now have the ability to create type stable case_when() and case_match() calls. i.e. this handles the most common case of .when:

mutate(
  x = case_match(x, NA ~ 0, .ptype = x, .default = x)
)

And that could be wrapped into a replace_match(x, NA ~ 0) helper. Updating multiple columns based on 1 condition is also something if_else() can do now:

mutate(
  if_else(
    is.na(x) | is.na(y),
    tibble(x = 0, y = 0),
    tibble(x = x, y = y)
  )
)

@DavisVaughan
Copy link
Member

A nice little alternative to mutate(.when = ) we could consider. Slightly simpler than if_else or case_when equivalents, also type and size stable by default, and takes integer positions for i rather than just logical ones. Also supports data frames for x and value if you want to use 1 condition for multiple columns, like the example just above.

replace_at <- function(x, i, value) {
  size <- vctrs::vec_size(x)
  
  i <- vctrs::vec_as_location(i = i, n = size, missing = "remove")
  
  # recycle up to size of x
  value <- vctrs::vec_recycle(value, size, x_arg = "value")
  
  # slice down to locations selected by i
  value <- vctrs::vec_slice(value, i)
  
  vctrs::vec_assign(x, i, value)
}

# with a vector the same size as x
mutate(
  flights,
  dep_delay = replace_at(dep_delay, dep_time > 500, -dep_delay)
)

# with a value
mutate(
  flights,
  dep_delay = replace_at(dep_delay, dep_time > 500, NA)
)

# at integer locations in x
mutate(
  flights,
  dep_delay = replace_at(dep_delay, c(5, 3), NA)
)

@krlmlr
Copy link
Member Author

krlmlr commented Nov 3, 2023

How about

mutate(flights, replace_at(dep_time > 500, dep_delay = -dep_delay))

with replace_at() returning a suitable data frame?

@DavisVaughan
Copy link
Member

That can't be written as a standalone function IIUC. My hope was that we could figure out something that works outside of dplyr too

@krlmlr
Copy link
Member Author

krlmlr commented Nov 3, 2023

I'm thinking about something along the following lines:

options(conflicts.policy = list(warn = FALSE))
library(rlang)
library(vctrs)
library(tibble)
library(dplyr)
library(purrr)

replace_at <- function(where, ..., .envir = parent.frame()) {
  replacement <- tibble(...)

  orig_names <- names(replacement)
  orig_values <- as_tibble(map(set_names(orig_names), get0, .envir))

  vec_assign(orig_values, where, replacement)
}

foo <- 1:3
replace_at(2, foo = 5)
#> # A tibble: 3 × 1
#>     foo
#>   <int>
#> 1     1
#> 2     5
#> 3     3

tibble(foo) |>
  mutate(replace_at(2, foo = 5))
#> # A tibble: 3 × 1
#>     foo
#>   <int>
#> 1     1
#> 2     5
#> 3     3

Created on 2023-11-03 with reprex v2.0.2

@thomasp85
Copy link
Member

tidygraph now has a focus() verb that sort of does this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
columns ↔️ Operations on columns: mutate(), select(), rename(), relocate() feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants