Conditionally mutate selected rows #4050

krlmlr · 2018-12-21T15:47:39Z

This would allow supporting an efficient mutate_if_row() verb here or elsewhere (assuming there's also a nice way to set the group data, as implemented in update_group_data() here). I remember a discussion about using the group data for other exciting things such as bootstrapping?

In the example below, the first three rows should remain unchanged.

library(tidyverse)

df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

update_group_data <- function(.data, group_data) {
  attr(.data, "groups") <- group_data
  .data
}

group_filter <- function(.data, ...) {
  new_group_data <-
    .data %>%
    group_data() %>%
    filter(...)

  .data %>%
    update_group_data(new_group_data)
}

mutate_if_row <- function(.data, cond, ...) {
  cond <- rlang::enquo(cond)

  .data %>%
    group_by(.flag = !!cond) %>%
    group_filter(.flag) %>%
    mutate(...) %>%
    ungroup() %>%
    select(-.flag)
}

df %>%
  mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1    NA
#> 2    NA
#> 3    NA
#> 4     5
#> 5     6

^{Created on 2018-12-21 by the reprex package (v0.2.1.9000)}

The text was updated successfully, but these errors were encountered:

krlmlr · 2018-12-21T15:53:50Z

We can fake it already, but overwriting would be a tad faster:

library(tidyverse)

df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

if_flag <- function(quo, name) {
  rlang::quo_set_expr(
    quo,
    expr(if (.flag[1]) !!rlang::quo_get_expr(quo) else !!rlang::sym(name))
  )
}

mutate_if_row <- function(.data, cond, ...) {
  cond <- rlang::enquo(cond)
  quos <- rlang::quos(...)

  quos <- map2(quos, names(quos), if_flag)

  .data %>%
    group_by(.flag = !!cond) %>%
    mutate(!!!quos) %>%
    ungroup() %>%
    select(-.flag)
}

df %>%
  mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     5
#> 5     6

^{Created on 2018-12-21 by the reprex package (v0.2.1.9000)}

romainfrancois · 2018-12-21T16:02:23Z

Also not that trivial to implement. We can only realistically do that when R says the object has only one reference.

This, to me, looks like modify by reference, à la data.table, and is out of scope for dplyr.

This sounds like a use case for case_when:

library(dplyr)

df <- tibble(a = 1:5)

df %>%
  mutate(a = case_when(
    a > 3 ~ a + 1L, 
    TRUE  ~ a
  ))
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     5
#> 5     6

^{Created on 2018-12-21 by the reprex package (v0.2.1.9000)}

krlmlr · 2018-12-21T16:21:17Z

mutate_if_row() is better, less noise and works for updating multiple columns at once. I've heard this question now multiple times in workshops.

I see your point, we need to copy anyway, even if R says it has only one copy. Copying via memcpy() or R's duplication mechanism still will be faster.

Maybe something to consider for 0.9.0?

romainfrancois · 2019-01-29T11:02:10Z

I see, I think I've been confused by the group_by(). see also mutate_when() from https://gist.github.com/romainfrancois/eeeed972d6734bcad3ec3dcf872df7ea

library(rlang)
library(dplyr)
library(purrr)

mutate_when <- function(data, condition, ...){
  condition <- enquo(condition)
  
  dots <- exprs(...)
  
  expressions <- map2( dots, syms(names(dots)), ~{
    quo( case_when(..condition.. ~ !!.x , TRUE ~ !!.y ) )
  })
  
  data %>%
    mutate( ..condition.. = !!condition ) %>%
    mutate( !!!expressions ) %>%
    select( -..condition..)
}

d <- tibble( x = 1:4, y = 1:4)
mutate_when( d, x < 3, 
  x = -x, 
  y = -y
)
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3     3     3
#> 4     4     4

^{Created on 2019-01-29 by the reprex package (v0.2.1.9000)}

romainfrancois · 2021-04-21T13:57:16Z

Here are some approaches using data frame returns:

"manually":

library(dplyr)
d <- tibble( x = 1:4, y = 1:4)

# using data frame returns
d %>% 
  mutate({
    test <- x < 4
    x[test] <- -x[test]
    y[test] <- -y[test]
    data.frame(x = x, y = y)
  })
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

if we want to do the same thing to a selected set of columns, we can use across() and a bit of code around:

# using across()
d %>% 
  mutate({
    test <- x < 4
    across(c(x, y), ~ {.x[test] <- -.x[test]; .x })
  })
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

and we can further abstract, e.g.

negate_if <- function(condition, cols) {
  across({{ cols }}, ~ {
    .x[condition] <- -.x[condition]
    .x
  })
}
d %>% 
  mutate(negate_if(x < 4, c(x, y)))
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

Now if we want to do arbitrary mutations, e.g. mutate_when(d, x < 4, x = -x, y = -y) we can do something like this, with some assumptions:

mutate_when <- function(.data, when, ...) {
  dots <- enquos(...)
  names <- names(dots)
  
  mutate(.data, {
    test <- {{ when }}
    
    changed <- data.frame(!!!dots)
    out <- across(all_of(names))
    # assuming `changed` and `out` have the same data frame type

    out[test, ] <- changed[test, ]
    out
  })
  
}
mutate_when(d, x < 4, x = -x, y = -y)
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

^{Created on 2021-04-21 by the reprex package (v0.3.0)}

This all feels like things we can do with the tools available, perhaps in some other package ?

k6adams · 2021-11-24T17:14:37Z

I just wanted to mention mutate_when() is a function I would love to see incorporated. I posted this question on stack overflow, essentially asking if there was a more simplistic syntax for creating variables in a mutate, without a bunch of repetitive case_when() or ifelse() statements.

In my particular use case, I am creating code which creates output based upon a flow chart. My intended end users are less familiar with R, and I don't want them to get overwhelmed by the sheer volume of repetitive code. IMO, this mutate_when() function is intuitive in conjunction with pipe %>%.

Because I am naive and new, I thought something like this might work...

data %>%
group_by(g1, g2) %>%
mutate(
   across(
      where(
         condition
      )
   ),
var1 = "happy",
var2 = var22 / var19 + 3,
var3 = ifelse(
   var2 >= 3,
   TRUE,
   FALSE
   ),
...more var statements...
 )

Thanks @romainfrancois for posting this function.

DavisVaughan · 2022-09-13T18:46:00Z

We gave a serious attempt at this in #6313 for dplyr 1.1.0, but ultimately decided not to add it in that release.

We aren't convinced that it is an operation that would be heavily used, as the main example usage we could come up with was replacing missing values, i.e.:

mutate(df, x = 0, .when = is.na(x))

We can't think of many examples beyond this one where this would be very useful.

Here are a few notes we should consider in the future when thinking about this:

Should groups be ignored when computing .when? To match SQL and data table, it makes sense to ignore groups. I also can't think of any examples where a grouped application of .when makes sense. But this confused some people, especially because they might be passing a grouped-df in, like group_by() %>% mutate(.when =). This becomes slightly less confusing in the context of .by, i.e. mutate(.when =, .by = ), where we'd just document that .when is applied first.
To be performant, we have to hook this into the data mask. You have to evaluate .when to get the locations where ... should be applied, and then only the columns referenced in ... should be sliced to the locations referenced by .when. We did this successfully in Implement mutate(.when =) #6313, but it required a decent chunk of refactoring.
Would you ever want more than 1 .when call in a single mutate()? Some people proposed an API of mutate(when(is.na(x), x = 0), when(y == 4, a = 5, b = 6)). I don't personally think this would be that useful. If we did this, we might also consider making sequential when() calls work in a case-when like fashion.
Should .when allow if_any() and if_all() in the expression? It seems like they might be useful as a way to compute a complex when expression based on multiple columns, but is somewhat hard to implement. We didn't do that in Implement mutate(.when =) #6313.

We have to think about how useful this function is in light of the fact that we now have the ability to create type stable case_when() and case_match() calls. i.e. this handles the most common case of .when:

mutate(
  x = case_match(x, NA ~ 0, .ptype = x, .default = x)
)

And that could be wrapped into a replace_match(x, NA ~ 0) helper. Updating multiple columns based on 1 condition is also something if_else() can do now:

mutate(
  if_else(
    is.na(x) | is.na(y),
    tibble(x = 0, y = 0),
    tibble(x = x, y = y)
  )
)

DavisVaughan · 2023-11-02T21:37:54Z

A nice little alternative to mutate(.when = ) we could consider. Slightly simpler than if_else or case_when equivalents, also type and size stable by default, and takes integer positions for i rather than just logical ones. Also supports data frames for x and value if you want to use 1 condition for multiple columns, like the example just above.

replace_at <- function(x, i, value) {
  size <- vctrs::vec_size(x)
  
  i <- vctrs::vec_as_location(i = i, n = size, missing = "remove")
  
  # recycle up to size of x
  value <- vctrs::vec_recycle(value, size, x_arg = "value")
  
  # slice down to locations selected by i
  value <- vctrs::vec_slice(value, i)
  
  vctrs::vec_assign(x, i, value)
}

# with a vector the same size as x
mutate(
  flights,
  dep_delay = replace_at(dep_delay, dep_time > 500, -dep_delay)
)

# with a value
mutate(
  flights,
  dep_delay = replace_at(dep_delay, dep_time > 500, NA)
)

# at integer locations in x
mutate(
  flights,
  dep_delay = replace_at(dep_delay, c(5, 3), NA)
)

krlmlr · 2023-11-03T05:34:49Z

How about

mutate(flights, replace_at(dep_time > 500, dep_delay = -dep_delay))

with replace_at() returning a suitable data frame?

DavisVaughan · 2023-11-03T12:09:35Z

That can't be written as a standalone function IIUC. My hope was that we could figure out something that works outside of dplyr too

krlmlr · 2023-11-03T15:40:53Z

I'm thinking about something along the following lines:

options(conflicts.policy = list(warn = FALSE))
library(rlang)
library(vctrs)
library(tibble)
library(dplyr)
library(purrr)

replace_at <- function(where, ..., .envir = parent.frame()) {
  replacement <- tibble(...)

  orig_names <- names(replacement)
  orig_values <- as_tibble(map(set_names(orig_names), get0, .envir))

  vec_assign(orig_values, where, replacement)
}

foo <- 1:3
replace_at(2, foo = 5)
#> # A tibble: 3 × 1
#>     foo
#>   <int>
#> 1     1
#> 2     5
#> 3     3

tibble(foo) |>
  mutate(replace_at(2, foo = 5))
#> # A tibble: 3 × 1
#>     foo
#>   <int>
#> 1     1
#> 2     5
#> 3     3

^{Created on 2023-11-03 with reprex v2.0.2}

thomasp85 · 2024-02-26T17:04:20Z

tidygraph now has a focus() verb that sort of does this

romainfrancois added the feature a feature request or enhancement label Dec 21, 2018

hadley changed the title ~~FR: When overwriting column, reuse existing data~~ Conditional mutate selected rows Dec 11, 2019

hadley added the verbs 🏃‍♀️ label Dec 11, 2019

hadley mentioned this issue Dec 12, 2019

Consider new row mutation functions #4654

Closed

hadley changed the title ~~Conditional mutate selected rows~~ Conditionally mutate selected rows Dec 31, 2019

romainfrancois mentioned this issue May 19, 2021

recode_when() tidyverse/funs#66

Open

krlmlr mentioned this issue Jun 24, 2021

Investigate rows_merge() #5179

Closed

This was referenced Aug 17, 2021

Support returning in rows_*.tbl_dbi() cynkra/dm#607

Merged

rows_update(): Conditional Update & Mutating Join cynkra/dm#609

Closed

romainfrancois added columns ↔️ Operations on columns: mutate(), select(), rename(), relocate() and removed verbs 🏃‍♀️ labels Oct 1, 2021

DavisVaughan mentioned this issue Jun 28, 2022

Implement mutate(.when =) #6313

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditionally mutate selected rows #4050

Conditionally mutate selected rows #4050

krlmlr commented Dec 21, 2018

krlmlr commented Dec 21, 2018

romainfrancois commented Dec 21, 2018 •

edited

Loading

krlmlr commented Dec 21, 2018

romainfrancois commented Jan 29, 2019

romainfrancois commented Apr 21, 2021

k6adams commented Nov 24, 2021 •

edited

Loading

DavisVaughan commented Sep 13, 2022

DavisVaughan commented Nov 2, 2023

krlmlr commented Nov 3, 2023

DavisVaughan commented Nov 3, 2023

krlmlr commented Nov 3, 2023 •

edited

Loading

thomasp85 commented Feb 26, 2024

Conditionally mutate selected rows #4050

Conditionally mutate selected rows #4050

Comments

krlmlr commented Dec 21, 2018

krlmlr commented Dec 21, 2018

romainfrancois commented Dec 21, 2018 • edited Loading

krlmlr commented Dec 21, 2018

romainfrancois commented Jan 29, 2019

romainfrancois commented Apr 21, 2021

k6adams commented Nov 24, 2021 • edited Loading

DavisVaughan commented Sep 13, 2022

DavisVaughan commented Nov 2, 2023

krlmlr commented Nov 3, 2023

DavisVaughan commented Nov 3, 2023

krlmlr commented Nov 3, 2023 • edited Loading

thomasp85 commented Feb 26, 2024

romainfrancois commented Dec 21, 2018 •

edited

Loading

k6adams commented Nov 24, 2021 •

edited

Loading

krlmlr commented Nov 3, 2023 •

edited

Loading